Running Analysis Pipelines - guide to containerized MPI-parallel workflows ========================================================================== Introduction ------------ Analysis pipelines for large-scale astronomical data processing can consist of a variety of specialized software tools and libraries. The latter may have been developed under operating systems and software environments that may differ from the actual runtime environment, i.e. a high-performance computing (HPC) cluster like Ramses. Therefore, it may be beneficial to wrap workflows in containers. Containerization is the packaging of all your analysis code with just the operating system (OS) libraries and dependencies required to run the code. A container is essentially a single lightweight executable that runs consistently on any infrastructure. When the container concept is extended to MPI-parallel applications, then some of the independence from the hosting OS gets lost, owing mainly to the fact that MPI applications depend on low-level system libraries. For example, efficient MPI communication often relies on low-latency network hardware like InfiniBand, which requires specific drivers and libraries to be available inside the container at both build and runtime. This guide describes the creation of workflow containers for MPI-parallel applications, where the focus is on employing ``mpi4py``, the MPI bindings for Python, as a high-level programming interface. Details are provided for building an Apptainer/Singularity image that features * Uses the **host’s** MPI, according to Apptainer’s **Bind model** * ``mpi4py`` compiled within Apptainer * Leverages Infiniband via UCX/UCC A prerequisite for the material covered below is some general understanding of * General containerization concepts * General MPI-parallel concepts * `Apptainer/Singularity `_ basics * `Python/mpi4py `_ .. :`field name`:code:: interpreted text with explicit role as suffix See the `Python home page `_ for info. Container basics and Apptainer/Singularity ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In principle, containerization involves three steps: 1. Define the container image (OS, dependencies, environment) via a definition file, i.e., what will the container do? 2. Build the container image from the definition file 3. Instantiate a container from that image and run on the host system Apptainer (formerly Singularity) is a containerization technology designed for HPC environments. It is more adequate for HPC systems than Docker mainly because root privileges are not required to run containers. The first step in containerization is to create a container image. Container images involve image definition files that specify, among other things, the base OS image. The ``Dockerfile`` is Docker's default name for such a definition file. In Apptainer, the Dockerfile counterpart is called Singularity Definition File (SDF). SDFs typically have the file extension ``*.def``. A container image is then created from an SDF, producing a read-only template containing instructions for spawning a container. The container image encapsulates the filesystem, environment, and metadata. In Apptainer, images are mostly stored in a format referred to as SIF (Singularity Image Format), using the extension ``*.sif``. Hence, a container is an instance of an image that runs as a process on the host system. A container is not started the same way as running a compiled binary, but rather through the container engine, which basically is a container manager. The engine provides the execution environment for container images and virtualizes the resources for containerized applications. For example, in the Windows OS, Docker Desktop is a container engine. In Apptainer, the container engine is part of the ``apptainer`` command-line tool. So a command like .. code:: bash apptainer exec myIMAGE.sif python myscript.py lets Apptainer instantiate a container from the template ``myIMAGE.sif`` and executes the Python code ``myscript.py``, where the latter runs inside the container, i.e., it uses the OS, filesystem and environment as defined in the image. Note that in the following, the words container and apptainer are used synonymously in contexts that distinguish between operations that happen inside or outside (on the host) of a container, i.e., one may read *inside/outside Apptainer ...*. MPI and ``mpi4py`` (MPI for Python) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Apptainer currently supports two open-source implementations of MPI; these are OpenMPI and MPICH. This document focuses on OpenMPI. MPI stands for Message Passing Interface (MPI) and is a common standard for performing communication across parallel computing architectures, which can be compute nodes of a single system or across compute platforms. MPI comes in form of libraries with a low-level set of commands that enable message passing, parallel IO, etc. Owing to its low-level nature, MPI is often used in conjunction with compiled HPC-suitable programming languages like C/C++ or Fortran. As opposed to more low-level scientific programming languages, Python is an ideal candidate for implementing the higher-level parts of compute-intensive applications. MPI for Python (``mpi4py``) has thus evolved in order to provide MPI bindings for Python programs. ``mpi4py`` builds on the MPI specification and provides an object oriented interface based on MPI-2 C++ bindings (see also `MPI for Python `__). MPI + Apptainer ~~~~~~~~~~~~~~~ According to the `Apptainer+MPI documentation `__, one distinguishes between two different ways of joining Apptainer with MPI. The first way is referred to as *Host-MPI model*, also called *Hybrid model*, which is useful when shipping containerized MPI-parallel applications. This model involves a containerized MPI working in conjunction with the host’s MPI. Therefore, the name *Hybrid* appears more suitable than *Host-MPI* as the latter may suggest involvement of only a host MPI (personal opinion of the author). The second method, referred to as *Bind-model* and explained in more detail below, involves mounting the host’s MPI into Apptainer. Hence, the containerized application uses the host MPI directly, that is, without a containerized MPI layer. Model 1: Hybrid model ^^^^^^^^^^^^^^^^^^^^^ The essence of the Hybrid model is that one executes some containerized MPI program, which in practice is realized as follows: :: $ mpirun -n apptainer exec where ```` is the Apptainer image (residing on the host). Here, ``mpirun`` is executed on the host, launched on the apptainer command itself. Hence, each MPI rank is launched as a separate container process; in other words, the host’s process manager daemon (ORTED in OpenMPI) launches Apptainer containers for each MPI rank. Inside Apptainer, the application (``/PATH/TO/MPI-PROGRAM/WITHIN/APPTAINER``) loads containerized MPI libraries which connect back to ORTED via PMI (Process Management Interface). This procedure results in MPI communication across container boundaries. It is the necessary way if directly mounting the host MPI into Apptainer is not possible due to security policies. The other case with less strict policies leads to the second model, called *Bind model*. Model 2: Bind model ^^^^^^^^^^^^^^^^^^^ The essence of the Bind model involves mounting the host’s MPI into Apptainer, again expressed via the ``mpirun`` command: :: $ export MPI_DIR= $ mpirun -n apptainer exec --bind $MPI_DIR Similar to the hybrid approach, the bind-approach starts the MPI application by calling the MPI launcher (``mpirun``) from the host. However, note that the major difference is the *bind/mount* of the host MPI through ``--bind $MPI_DIR``. Figuratively speaking, the host sets the stage by launching ranks, providing (MPI) libraries, etc., while the container just brings its own environment, i.e. Python code and other dependencies, not including an MPI layer. This can make such containers more lightweight, as they do not include any MPI implementation. Comparison Hybrid Model vs Bind Model in Apptainer ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The Hybrid model’s advantage of providing a higher degree of (containerized) autonomy comes with the disadvantage of a compatibility requirement between container MPI and host MPI. The keyword is ABI (Application Binary Interface) compatibility. Basically, ABI provides a low-level specification that defines how compiled programs interact with the system and with each other at the binary level. ABI compatibility of two (separately compiled) binaries ensures matching specifications on data type sizes and alignment, function calling conventions, system call numbers, register usage, stack layout, and exception handling. Non-compatibility may manifest in elusive runtime errors or segmentation faults despite a seemingly successful compilation process. .. Table for comparison of Hybrid vs Bind model MPI in Apptainer +---------------------------+----------------------------------------------------------+------------------------------------------+ | Feature | **Hybrid Model** | **Bind Model** | +---------------------------+----------------------------------------------------------+------------------------------------------+ | **MPI inside container?** | Containerized MPI installation interacts with host MPI | Container uses host MPI via bind mount | +---------------------------+----------------------------------------------------------+------------------------------------------+ | **MPI launcher location** | Host - ``mpirun`` launches apptainer (containers) | Host - ``mpirun`` launches apptainer | +---------------------------+----------------------------------------------------------+------------------------------------------+ | **MPI process management**| OpenMPI daemon (ORTED) launches containers and connects | No ORTED inside container — | | | via Process Management Interface (PMI) | host MPI handles everything | +---------------------------+----------------------------------------------------------+------------------------------------------+ | **Compatibility needs** | ⚠️ Container MPI must be ABI-compatible with host MPI | ✅ Compatibility given through host MPI | +---------------------------+----------------------------------------------------------+------------------------------------------+ | **Performance tuning** | ⚠️ Container MPI must be configured for host hardware, | ✅ Host MPI already tuned for hardware | | | e.g., UCX, verbs | | +---------------------------+----------------------------------------------------------+------------------------------------------+ | **Container size** | 📦 Larger - includes MPI stack | 📦 Smaller - no MPI inside container | +---------------------------+----------------------------------------------------------+------------------------------------------+ | **Use case** | When bind-mounts are restricted, host MPI not accessible,| When bind mounts are allowed, for | | | for portability and full container isolation | simplicity and lightweight containers | +---------------------------+----------------------------------------------------------+------------------------------------------+ This Howto focuses on the Bind model. As the Bind model’s ``mpirun`` template above shows, this requires two steps: 1) Know where the MPI implementation on the host is installed (``MPI_DIR``) 2) Mount/bind the host MPI into the container in a location where the (container’s) system will be able to find libraries and binaries. If one wishes to unlock the full MPI performance of a given HPC system, Step 2 brings up another layer of bind-mounts not yet mentioned, while applicable to probably the majority of HPC systems. On a hardware level, many HPC systems use fast low-latency network technology like InfiniBand. It is desirable for MPI-containers to stay on the Infiniband highway, i.e., not have to revert to a slower network alternative. This will require the software translators that link your MPI programs to the InfiniBand hardware to be available inside a container. Such translators are UCX and UCC. What is UCX and UCC? ^^^^^^^^^^^^^^^^^^^^ UCX (Unified Communication X) is a high-performance messaging layer that accelerates data transfer across HPC systems. It supports transports like Infiniband, shared memory, TCP, and GPU interconnects. UCX acts as the backbone for MPI implementations, accelerating data movement between nodes and devices. UCC (Unified Collective Communication) builds on UCX by optimizing collective operations like broadcast, reduce, and barrier, as these are essential for scaling parallel applications. UCC ensures that these operations run efficiently across modern interconnects like Infiniband and NVLink. | Together, UCX + UCC allow MPI libraries and higher-level programs (using mpi4py) to fully exploit hardware acceleration for both individual and group communications. From low-level to high-level, the stack thus reads: | Hardware (i.e., Infiniband, NVLink) ➡️ UCX (Transports) ➡️ UCC (Collectives) ➡️ MPI Library (e.g., OpenMPI) ➡️ Python (``mpi4py``) Without going into more detail, the basic rule is: To leverage high-performance messaging layers, that is, to use **Infiniband via UCX/UCC**, an MPI-capable container requires the exposure of UCX/UCC-related libraries at build and runtime. Here, build time refers to the compilation of ``mpi4py`` inside Apptainer, that is, one cannot use a pre-built ``mpi4py`` that was compiled without UCX/UCC support. Without UCX/UCC support, our MPI program (or ``mpi4py``) would fall back to slower, less capable transports like TCP-only, or it may fail to initialize. The following section describes the steps for building a UCX/UCC-capable Apptainer image for the Bind-model approach. Building a sample Bind-model container -------------------------------------- | Scripts and supporting files are on the Ramses HPC cluster, directory: | ``/projects/sw/Apptainer/Build_Examples/build_mpi4py/`` Step 1) Build base sandbox from Apptainer definition file ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The starting point is the following SDF (Singularity/Apptainer definition file), to be found under ``/projects/sw/Apptainer/Build_Examples/build_mpi4py/centos_stream9.def`` .. code:: Bootstrap: docker From: rockylinux:9 %labels Author Michael Commer @ ITCC Purpose "Python + mpi4py container using host MPI/RDMA stack on RHEL9" %environment export MPI_PATH=/projects/sw/eb/arch/zen4/software/OpenMPI/4.1.5-GCC-12.3.0 export PATH=${MPI_PATH}/bin:/opt/venv/bin:$PATH export LD_LIBRARY_PATH=${MPI_PATH}/lib:/opt/lib:/opt/ucc:/opt/ucx:/opt/gcc:\ /opt/hwloc:/opt/libfabric:/opt/numactl:/opt/gpfs:/opt/binutils:/opt/zlib:\ /opt/libxml2:/opt/libpciaccess:$LD_LIBRARY_PATH export PYTHONPATH=/opt/venv/lib/python3.9/site-packages:$PYTHONPATH %post # Install system dependencies dnf install -y --allowerasing \ python3 python3-pip python3-devel \ gcc gcc-c++ make \ util-linux git curl ca-certificates \ && dnf clean all # Create mount points for host MPI and libraries export MPI_PATH=/projects/sw/eb/arch/zen4/software/OpenMPI/4.1.5-GCC-12.3.0 for d in lib ucc ucx gcc hwloc libfabric numactl gpfs binutils zlib libxml2 libpciaccess; do mkdir -p /opt/$d done mkdir -p /etc/libibverbs.d $MPI_PATH # Create and activate Python virtual environment python3 -m venv /opt/venv source /opt/venv/bin/activate %files # copy specific MPI-related/RDMA driver libs into the container /lib64/libefa.so.1 /opt/lib/libefa.so.1 /lib64/libibverbs.so.1 /opt/lib/libibverbs.so.1 /lib64/libm.so.6 /opt/lib/libm.so.6 /lib64/libnl-3.so.200 /opt/lib/libnl-3.so.200 /lib64/libnl-route-3.so.200 /opt/lib/libnl-route-3.so.200 /lib64/libpmi.so.0 /opt/lib/libpmi.so.0 /lib64/libpmi2.so.0 /opt/lib/libpmi2.so.0 /lib64/librdmacm.so.1 /opt/lib/librdmacm.so.1 /lib64/libresolv.so.2 /opt/lib/libresolv.so.2 /lib64/libuuid.so.1 /opt/lib/libuuid.so.1 /lib64/libz.so.1 /opt/lib/libz.so.1 /usr/lib64/libibverbs/libmlx5-rdmav34.so /opt/lib/libmlx5-rdmav34.so /usr/lib64/slurm/libslurm_pmi.so /opt/lib/libslurm_pmi.so Building the sandbox is done as follows: .. code:: bash apptainer -v build --fakeroot --sandbox sandbox_centos_stream9 centos_stream9.def As a shortcut, you can use the helper-script command ``./build.sh bs`` instead. When done, you will see an unpacked directory tree ``sandbox_centos_stream9/``, referred to as a sandbox (image). Such a sandbox is useful for development and debugging because you can always re-enter it via ``apptainer shell --writable sandbox_centos_stream9/`` and install/fix things inside without rebuilding. Step 2) Compile mpi4py inside sandbox ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now, we want to rebuild ``mpi4py`` against the **host OpenMPI+UCX+UCC** stack. This step involves * loading the appropriate MPI-module: ``module purge && module load mpi/OpenMPI/4.1.5-GCC-12.3.0`` * bind-mouting MPI: ``... --bind ${MPI_PATH}:${MPI_PATH}:ro`` * bind-mounting UCX/UCC and other MPI-related libraries, for example: ``... --bind /projects/sw/eb/arch/zen4/software/UCX/1.14.1-GCCcore-12.3.0/lib:/opt/ucx:ro`` You can perform these steps via the helper-script: ``./build.sh is``, which esentially performs all necessary bind-mounts: .. code:: bash # Enter sandbox and manually compile mpi4py MPI_PATH=/projects/sw/eb/arch/zen4/software/OpenMPI/4.1.5-GCC-12.3.0 FMOD=mpi/OpenMPI/4.1.5-GCC-12.3.0 module purge && module load $FMOD # ... apptainer shell --writable \ --bind ${MPI_PATH}:${MPI_PATH}:ro \ --bind /etc/libibverbs.d:/etc/libibverbs.d:ro \ --bind /projects/sw/eb/arch/zen4/software/UCX/1.14.1-GCCcore-12.3.0/lib:/opt/ucx:ro \ --bind /projects/sw/eb/arch/zen4/software/UCC/1.2.0-GCCcore-12.3.0/lib:/opt/ucc:ro \ --bind /projects/sw/eb/arch/zen4/software/GCCcore/12.3.0/lib/../lib64:/opt/gcc:ro \ --bind /projects/sw/eb/arch/zen4/software/hwloc/2.9.1-GCCcore-12.3.0/lib:/opt/hwloc:ro \ --bind /projects/sw/eb/arch/zen4/software/libfabric/1.21.0-GCCcore-12.3.0/lib:/opt/libfabric:ro \ --bind /projects/sw/eb/arch/zen4/software/numactl/2.0.16-GCCcore-12.3.0/lib:/opt/numactl:ro \ --bind /usr/lpp/mmfs/lib:/opt/gpfs:ro \ --bind /projects/sw/eb/arch/zen4/software/binutils/2.40-GCCcore-12.3.0/lib:/opt/binutils:ro \ --bind /projects/sw/eb/arch/zen4/software/zlib/1.2.13-GCCcore-12.3.0/lib:/opt/zlib:ro \ --bind /projects/sw/eb/arch/zen4/software/libxml2/2.11.4-GCCcore-12.3.0/lib:/opt/libxml2:ro \ --bind /projects/sw/eb/arch/zen4/software/libpciaccess/0.17-GCCcore-12.3.0/lib:/opt/libpciaccess:ro \ sandbox_centos_stream9 Once inside the sandbox (at the prompt ``Apptainer>``), launch the ``mpi4py`` compilation: .. code:: bash Apptainer> LDFLAGS="-L/opt/ucx -L/opt/ucc -lucp -lucs -luct -lucc" pip install --no-binary=mpi4py --no-cache-dir --force-reinstall mpi4py Building ``mpi4py`` inside Apptainer involves compiling Python bindings that link against the MPI libraries and their underlying (UCX and UCC) communication frameworks. In this case, it was found that the compiler and linker did not find the associated UCX/UCC libraries automatically, despite adequate settings of LD_LIBRARY_PATH, as LD_LIBRARY_PATH is not evaluated at compile time. Hence the additional passing of ``LDFLAGS="-L/opt/ucx ..."`` to ``pip install``. On other systems, and with other versions of MPI/mpi4py etc., it may become a trial-and-error procedure to pass the correct linking-flags. Step 2b, optional) Verify UCX/UCC transport inside Apptainer ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To make sure UCX libraries are present and contain required symbols: .. code:: bash Apptainer> nm -D /opt/ucx/libucs.so | grep ucs_mpool_params_reset which should produce something like ``00000000000233b0 T ucs_mpool_params_reset``, where the ``T`` represents the state of a defined symbol. If missing, UCX may be outdated, incompatible or incorrectly linked. Afterwards, check UCC linkage to UCX: .. code:: bash Apptainer> ldd /opt/ucc/libucc.so.1 | grep libucs which should produce something like ``libucs.so.0 => /opt/ucx/libucs.so.0 (0x00007fa8d88da000)``. Finally, you could try out ``mpi4py``: :: Apptainer> python # Then run this 2-liner: >>> from mpi4py import MPI >>> print(f"Hello from rank {MPI.COMM_WORLD.Get_rank()} of {MPI.COMM_WORLD.Get_size()}") Step 3) Build final image in Singularity image format (\*.sif) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When happy with the performance of the sandbox, you can build a final immutable image: .. code:: bash apptainer build centos_stream9.sif sandbox_centos_stream9 Again, this step can also be run through the helper-script: ``./build.sh bf``. Generally, keep the sandbox as long as you are still iterating and expect to configure and debug inside the container. Create the final ``.sif`` image once the build works and you want something stable and portable. Step 3b, optional) Perform simple ``mpi4py`` benchmark ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can run an interactive slurm job with two nodes in order to test node-to-node communication using the ``mpi4py.bench`` module: .. code:: bash salloc --nodes=2 --ntasks-per-node=1 --job-name=n2xt1 --time=00:30:00 # when interactive slurm job has started: module load mpi/OpenMPI/4.1.5-GCC-12.3.0 ./build.sh r # runs "mpi4py.bench pingpong" Assigning one task per node forces communication across the two nodes. The helper-script command ``./build.sh r`` is a shortcut to the command .. code:: bash m=67108868 # max packet size for benchmark test appt_tools mpirun -n 2 python -m mpi4py.bench pingpong --max-size $m where ``appt_tools`` is another shortcut to the above ``mpirun ... apptainer exec --bind ...`` command. The tool ``appt_tools`` is described further below. In the current case, ``appt_tools`` tries to ease your life by creating and launching a script ``mpirun.sh`` containing the following lengthy ``mpirun``-command: .. code:: bash MPI_DIR=/projects/sw/eb/arch/zen4/software/OpenMPI/4.1.5-GCC-12.3.0 FSIF=/projects/sw/Apptainer/SIF-files/centos_stream9+mpi4py.sif # copy of the final sif we build above FEXE="python -m mpi4py.bench pingpong --max-size 67108868" # command to run inside Apptainer mpirun -n 2 apptainer exec \ --bind ${MPI_DIR}:${MPI_DIR}:ro \ --bind /etc/libibverbs.d:/etc/libibverbs.d:ro \ --bind /projects/sw/eb/arch/zen4/software/UCX/1.14.1-GCCcore-12.3.0/lib:/opt/ucx:ro \ --bind /projects/sw/eb/arch/zen4/software/UCC/1.2.0-GCCcore-12.3.0/lib:/opt/ucc:ro \ --bind /projects/sw/eb/arch/zen4/software/GCCcore/12.3.0/lib/../lib64:/opt/gcc:ro \ --bind /projects/sw/eb/arch/zen4/software/hwloc/2.9.1-GCCcore-12.3.0/lib:/opt/hwloc:ro \ --bind /projects/sw/eb/arch/zen4/software/libfabric/1.21.0-GCCcore-12.3.0/lib:/opt/libfabric:ro \ --bind /projects/sw/eb/arch/zen4/software/numactl/2.0.16-GCCcore-12.3.0/lib:/opt/numactl:ro \ --bind /projects/sw/eb/arch/zen4/software/binutils/2.40-GCCcore-12.3.0/lib:/opt/binutils:ro \ --bind /projects/sw/eb/arch/zen4/software/zlib/1.2.13-GCCcore-12.3.0/lib:/opt/zlib:ro \ --bind /projects/sw/eb/arch/zen4/software/libxml2/2.11.4-GCCcore-12.3.0/lib:/opt/libxml2:ro \ --bind /projects/sw/eb/arch/zen4/software/libpciaccess/0.17-GCCcore-12.3.0/lib:/opt/libpciaccess:ro \ --bind /usr/lpp/mmfs/lib:/opt/gpfs:ro \ $FSIF $FEXE The pingpong test shows this kind of output: :: # MPI PingPong Test # Size [B] Bandwidth [MB/s] | Time Mean [s] ± StdDev [s] Samples 1 0.62 | 1.6172415e-06 ± 1.6920e-07 10000 2 1.23 | 1.6257193e-06 ± 1.9910e-07 10000 4 2.47 | 1.6217059e-06 ± 2.0745e-07 10000 8 4.85 | 1.6494426e-06 ± 1.5699e-07 10000 ... 1048576 11412.46 | 9.1879942e-05 ± 5.1434e-07 1000 2097152 11831.73 | 1.7724815e-04 ± 4.8272e-07 10 4194304 12080.45 | 3.4719765e-04 ± 3.8304e-07 10 8388608 12209.32 | 6.8706595e-04 ± 3.6534e-07 10 16777216 12272.32 | 1.3670782e-03 ± 4.2741e-07 10 33554432 12296.29 | 2.7288256e-03 ± 6.3523e-06 10 67108864 12322.43 | 5.4460724e-03 ± 7.1724e-07 10 where a convergence of the value *Bandwidth [MB/s]* towards ~12000 MB/s indicates the expected bandwidth with UCX/UCC transport (this holds for a modest load on the compute nodes). A side remark is that one may see the following kind of message upon launching a containerized MPI-parallel application: .. code-block:: text Open MPI's OFI driver detected multiple equidistant NICs from the current process, but had insufficient information to ensure MPI processes fairly pick a NIC for use. This may negatively impact performance. A more modern PMIx server is necessary to resolve this issue. Note: This message is displayed only when the OFI component's verbosity level is -79797552 or higher. OpenMPI uses PMIx to determine process locality, such as which CPU socket or NUMA domain a process is on. If PMIx cannot provide that info, e.g., due to an older version or missing integration, OpenMPI’s OFI (libfabric) transport layer fails to make smart NIC choices and falls back to round-robin or default behavior. Hence, this is only a performance advisory about *seeing* multiple NICs (e.g., Infiniband interfaces) that are equally *close* to the process. There is an internal setting .. code:: bash export FI_PROVIDER=verbs export FI_VERBS_IFACE=ib0 carried out by ``appt_tools`` prior to launching ``mpirun`` which should help avoid the OFI/NIC-related advisory. appt_tools - a helper tool for containerized/apptainerized mpirun/sbatch jobs ----------------------------------------------------------------------------- One of the main purposes of ``appt_tools`` is to provide shortcuts to common apptainer command lines that involve ``mpirun`` through the Bind model. :: $ mpirun -n apptainer exec --bind "$MPI_DIR" --bind ... with potentially numerous ``--bind`` entries. While ``appt_tools`` has other functionalities, here only the mpirun-feature is described. This feature is activated via the ``mpirun`` subcommand: .. code:: bash appt_tools [OPTIONS] mpirun which creates and instantly launches a run script ``mpirun.sh``. You can use the option ``-o `` to choose a different output script name. If you prefer not to launch the script right away, use .. code:: bash appt_tools [OPTIONS] mpirunx which will only create ``mpirun.sh``. The helper-script ``build.sh`` contains two examples: .. code:: bash # # appt_tools - mpirun Example 1: mpi4py.bench m=67108868 # max packet size for benchmark test # writes a bash script mpi4py.bench.sh and launches it appt_tools -V -o mpi4py.bench.sh mpirun -n 2 python -m mpi4py.bench pingpong --max-size $m # # appt_tools - mpirun Example 2: simple MPI ring-communication test # writes a bash script mpirun.sh and launches it exe=/projects/sw/Apptainer/usr/bin/mpiinitst.4.1.5-GCC-12.3.0.exe appt_tools -V mpirun -n 2 $exe which you can run with ``./build.sh r1`` and ``./build.sh r2``, respectively. appt_tools runtime parameters ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In the output of ``./build.sh r1`` or ``./build.sh r2``, one notes the settings .. code-block:: text MPI-module: mpi/OpenMPI/4.1.5-GCC-12.3.0 (already-loaded MPI-module mpi/* in parent-shell) MPI_DIR: /projects/.../OpenMPI/4.1.5-GCC-12.3.0 (module prepend-path PATH) SIFFILE: ./centos_stream9+mpi4py.sif (Apptainer/Singularity-image-format file) ``MPI_DIR`` is required to mount the host’s MPI into Apptainer and is evaluated from the setting for ``MPI-module``. ``SIFFILE`` provides the image file (````) for the ``apptainer exec`` command. Different options exist for providing these values to ``appt_tools`` as outlined in the following. appt_tools configuration (\*.ini) file ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This section describes runtime parameters that are needed when launching *apptainerized* MPI programs. Different ways exist for providing runtime parameters to ``appt_tools``. All runtime parameters have in common that they can be set via command-line arguments or configuration (``*.ini``) file settings. The ``appt_tools`` configuration (``*.ini``) file format is not exactly the same as the common INI file format as it has the general format :: [VAR] # value or :: [VAR] # list of values ... [mod]: MPI-module / MPI_DIR ^^^^^^^^^^^^^^^^^^^^^^^^^^^ | The host’s MPI-module and its associated bind/mount directory ``MPI_DIR`` can be provided in different ways. The following 5 ways are evaluated in the listed order until a setting for MPI-module is found. - Command-line option ``-m`` / ``--module``: You can provide the MPI-module via the command-line .. code-block:: bash appt_tools -m mpi/OpenMPI/4.1.5-GCC-12.3.0 mpirun ... This will let ``appt_tools`` extract ``MPI_DIR`` from the PATH information of the specified module, here ``mpi/OpenMPI/4.1.5-GCC-12.3.0``. - ``module load`` line present in **slurm** file: Only applicable for runs that involve the ``sbacth``/``sbatchx`` subcommand. See the ``appt_tools`` documentation, to be invoked via ``appt_tools -doc``. - Setting ``[mod]`` in ``appt_tools`` configuration file. The configuration file is a file with the file suffix ``ini``. An adequate entry would then be .. code-block:: text [mod] # MPI-module mpi/OpenMPI/4.1.5-GCC-12.3.0 - MPI-module already loaded: If the above steps do not produce a setting for an MPI-module, the next attempt consists of figuring out if an adequate MPI-module is already loaded in the current environment. This is done internally via the ``module list`` command. - Environment variable ``MOD_MPI``: The last option of providing MPI-module information consists of a variable setting like (in bash) .. code:: bash export MOD_MPI=mpi/OpenMPI/4.1.5-GCC-12.3.0 If working with always the same MPI-module, setting ``MOD_MPI`` in your shell initialization file (i.e., ``~/.bashrc``) might be useful. [sif]: IMAGE file (\*.sif) ^^^^^^^^^^^^^^^^^^^^^^^^^^ | Setting the Apptainer image file (````) can also be done in different ways, which are evaluated in the following order. - Command-line option ``-s`` / ``--sif``: You can provide an absolute or relative file path for ````: .. code-block:: bash appt_tools -s /projects/sw/Apptainer/SIF-files/centos_stream9+mpi4py.sif mpirun ... - Setting ``[sif]`` in ``appt_tools`` configuration (``*.ini``) file. An adequate entry would be .. code-block:: text [sif] # SIFFILE /projects/sw/Apptainer/SIF-files/centos_stream9+mpi4py.sif - Find the newest (if multiple files are present) ``*.sif`` file in current directory. - Environment variable ``APT_SIFFILE``: The last option of providing ``SIFFILE`` consists of a variable setting like (in bash) .. code-block:: bash export APT_SIFFILE=/path/to/your/file.sif If repeatedly working with the same ``SIFFILE``, setting ``APT_SIFFILE`` in your shell initialization file (i.e., ``~/.bashrc``) might be useful.