PyMAPDL on HPC clusters#

Introduction#

PyMAPDL communicates with MAPDL using the gRPC protocol. This protocol offers the many advantages and features described in see PyMAPDL project. One of these features is that it is not required to have both PyMAPDL and MAPDL processes running on the same machine. This possibility opens the door to many configurations, depending on whether or not you run them both on the HPC compute nodes. Additionally, you might be able interact with them (interactive mode) or not (batch mode).

PyMAPDL takes advantage of HPC clusters to launch MAPDL instances with increased resources. PyMAPDL automatically sets these MAPDL instances to read the scheduler job configuration (which includes machines, number of CPUs, and memory), which allows MAPDL to use all the resources allocated to that job. For more information, see Tight integration between MAPDL and the HPC scheduler.

The following configurations are supported:

Batch job submission from the login node#

Many HPC clusters allow their users to log into a machine using ssh, vnc, rdp, or similar technologies and then submit a job to the cluster from there. This login machine, sometimes known as the head node or entrypoint node, might be a virtual machine (VDI/VM).

In such cases, once the Python virtual environment with PyMAPDL is already set and is accessible to all the compute nodes, launching a PyMAPDL job from the login node is very easy to do using the sbatch command. When the sbatch command is used, PyMAPDL runs and launches an MAPDL instance in the compute nodes. No changes are needed on a PyMAPDL script to run it on an SLURM cluster.

First the virtual environment must be activated in the current terminal.

user@entrypoint-machine:~$ export VENV_PATH=/my/path/to/the/venv
user@entrypoint-machine:~$ source $VENV_PATH/bin/activate

Once the virtual environment is activated, you can launch any Python script that has the proper Python shebang (#!/usr/bin/env python3).

For instance, assume that you want to launch the following main.py Python script:

main.py#
#!/usr/bin/env python3

from ansys.mapdl.core import launch_mapdl

mapdl = launch_mapdl(run_location="/home/ubuntu/tmp/tmp/mapdl", loglevel="debug")

print(mapdl.prep7())
print(f'Number of CPU: {mapdl.get_value("ACTIVE", 0, "NUMCPU")}')

mapdl.exit()

You can run this command in your console:

(venv) user@entrypoint-machine:~$ sbatch main.py

Alternatively, you can remove the shebang from the Python file and use a Python executable call:

(venv) user@entrypoint-machine:~$ sbatch python main.py

Additionally, you can change the number of cores used in your job by setting the PYMAPDL_NPROC environment variable to the desired value.

(venv) user@entrypoint-machine:~$ PYMAPDL_NPROC=4 sbatch main.py

For more applicable environment variables, see Environment variables.

You can also add sbatch options to the command:

(venv) user@entrypoint-machine:~$ PYMAPDL_NPROC=4 sbatch  main.py

For instance, to launch a PyMAPDL job that starts a four-core MAPDL instance on a 10-CPU SLURM job, you can run this command:

(venv) user@entrypoint-machine:~$ PYMAPDL_NPROC=4 sbatch --partition=qsmall --nodes=10 --ntasks-per-node=1 main.py

Using a submission script#

If you need to customize your PyMAPDL job further, you can create a SLURM submission script for submitting it. In this case, you must create two files:

  • Python script with the PyMAPDL code

  • Bash script that activates the virtual environment and calls the Python script

main.py#
from ansys.mapdl.core import launch_mapdl

# Number of processors must be lower than the
# number of CPU allocated for the job.
mapdl = launch_mapdl(nproc=10)

mapdl.prep7()
n_proc = mapdl.get_value("ACTIVE", 0, "NUMCPU")
print(f"Number of CPU: {n_proc}")

mapdl.exit()
job.sh#
#!/bin/bash
# Set SLURM options
#SBATCH --job-name=ansys_job            # Job name
#SBATCH --partition=qsmall              # Specify the queue/partition name
#SBATCH --nodes=5                       # Number of nodes
#SBATCH --ntasks-per-node=2             # Number of tasks (cores) per node
#SBATCH --time=04:00:00                 # Set a time limit for the job (optional but recommended)

# Set env vars
export MY_ENV_VAR=VALUE

# Activate Python virtual environment
source /home/user/.venv/bin/activate
# Call Python script
python main.py

To start the simulation, you use this code:

user@machine:~$ sbatch job.sh

In this case, the Python virtual environment does not need to be activated before submission since it is activated later in the script.

The expected output of the job follows:

Number of CPU: 10.0

The bash script allows you to customize the environment before running the Python script. This bash script performs tasks such as creating environment variables, moving files to different directories, and printing to ensure your configuration is correct.

Interactive MAPDL instance launched from the login node#

Starting the instance#

If you are already logged in a login node, you can launch an MAPDL instance as a SLURM job and connect to it. To accomplish this, run these commands in your login node.

>>> from ansys.mapdl.core import launch_mapdl
>>> mapdl = launch_mapdl(launch_on_hpc=True)

PyMAPDL submits a job to the scheduler using the appropriate commands. In case of SLURM, it uses the sbatch command with the --wrap argument to pass the MAPDL command line to start. Other scheduler arguments can be specified using the scheduler_options argument as a Python dict:

>>> from ansys.mapdl.core import launch_mapdl
>>> scheduler_options = {"nodes": 10, "ntasks-per-node": 2}
>>> mapdl = launch_mapdl(launch_on_hpc=True, nproc=20, scheduler_options=scheduler_options)

Note

PyMAPDL cannot infer the number of CPUs that you are requesting from the scheduler. Hence, you must specify this value using the nproc argument.

The double minus (--) common in the long version of some scheduler commands are added automatically if PyMAPDL detects it is missing and the specified command is long more than 1 character in length). For instance, the ntasks-per-node argument is submitted as --ntasks-per-node.

Or, a single Python string (str) is submitted:

>>> from ansys.mapdl.core import launch_mapdl
>>> scheduler_options = "-N 10"
>>> mapdl = launch_mapdl(launch_on_hpc=True, scheduler_options=scheduler_options)

Warning

Because PyMAPDL is already using the --wrap argument, this argument cannot be used again.

The values of each scheduler argument are wrapped in single quotes (). This might cause parsing issues that can cause the job to fail after successful submission.

PyMAPDL passes all the environment variables of the user to the new job and to the MAPDL instance. This is usually convenient because many environmental variables are needed to run the job or MAPDL command. For instance, the license server is normally stored in the ANSYSLMD_LICENSE_FILE environment variable. If you prefer not to pass these environment variables to the job, use the SLURM argument --export to specify the desired environment variables. For more information, see SLURM documentation.

Working with the instance#

Once the Mapdl object has been created, it does not differ from a normal Mapdl instance. You can retrieve the IP of the MAPDL instance as well as its hostname:

>>> mapdl.ip
'123.45.67.89'
>>> mapdl.hostname
'node0'

You can also retrieve the SLURM job ID:

>>> mapdl.jobid
10001

If you want to check whether the instance has been launched using a scheduler, you can use the mapdl_on_hpc attribute:

>>> mapdl.mapdl_on_hpc
True

Sharing files#

Most of the HPC clusters share the login node filesystem with the compute nodes, which means that you do not need to do extra work to upload or download files to the MAPDL instance. You only need to copy them to the location where MAPDL is running. You can obtain this location with the directory attribute.

If no location is specified in the launch_mapdl() function, then a temporal location is selected. It is a good idea to set the run_location argument to a directory that is accessible from all the compute nodes. Normally anything under /home/user is available to all compute nodes. If you are unsure where you should launch MAPDL, contact your cluster administrator.

Additionally, you can use methods like the upload and download to upload and download files to and from the MAPDL instance respectively. You do not need ssh or another similar connection. However, for large files, you might want to consider alternatives.

Exiting MAPDL#

Exiting MAPDL, either intentionally or unintentionally, stops the job. This behavior occurs because MAPDL is the main process at the job. Thus, when finished, the scheduler considers the job done.

To exit MAPDL, you can use the exit() method. This method exits MAPDL and sends a signal to the scheduler to cancel the job.

mapdl.exit()

When the Python process you are running PyMAPDL on finishes without errors, and you have not issued the exit() method, the garbage collector kills the MAPDL instance and its job. This is intended to save resources.

If you prefer that the job is not killed, set the following attribute in the Mapdl class:

mapdl.finish_job_on_exit = False

In this case, you should set a timeout in your job to avoid having the job running longer than needed.

Handling crashes on an HPC#

If MAPDL crashes while running on an HPC, the job finishes right away. In this case, MAPDL disconnects from MAPDL. PyMAPDL retries to reconnect to the MAPDL instance up to 5 times, waiting for up to 5 seconds. If unsuccessful, you might get an error like this:

MAPDL server connection terminated unexpectedly while running:
/INQUIRE,,DIRECTORY,,
called by:
_send_command

Suggestions:
MAPDL *might* have died because it executed a not-allowed command or ran out of memory.
Check the MAPDL command output for more details.
Open an issue on GitHub if you need assistance: https://github.com/ansys/pymapdl/issues
Error:
failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50052: Failed to connect to remote host: connect: Connection refused (111)
Full error:
<_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50052: Failed to connect to remote host: connect: Connection refused (111)"
debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-10-24T08:25:04.054559811+00:00", grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50052: Failed to connect to remote host: connect: Connection refused (111)"}"
>

The data of that job is available at directory. You should set the run location using the run_location argument.

While handling this exception, PyMAPDL also cancels the job to avoid resources leaking. Therefore, the only option is to start a new instance by launching a new job using the launch_mapdl() function.

User case on a SLURM cluster#

Assume a user wants to start a remote MAPDL instance in an HPC cluster to interact with it. The user would like to request 10 nodes, and 1 task per node (to avoid clashes between MAPDL instances). The user would like to also request 64 GB of RAM. Because of administration logistics, the user must use the machines in the supercluster01 partition. To make PyMAPDL launch an instance like that on SLURM, run the following code:

from ansys.mapdl.core import launch_mapdl
from ansys.mapdl.core.examples import vmfiles

scheduler_options = {
    "nodes": 10,
    "ntasks-per-node": 1,
    "partition": "supercluster01",
    "memory": 64,
}
mapdl = launch_mapdl(launch_on_hpc=True, nproc=10, scheduler_options=scheduler_options)

num_cpu = mapdl.get_value("ACTIVE", 0, "NUMCPU")  # It should be equal to 10

mapdl.clear()  # Not strictly needed.
mapdl.prep7()

# Run an MAPDL script
mapdl.input(vmfiles["vm1"])

# Let's solve again to get the solve printout
mapdl.solution()
output = mapdl.solve()
print(output)

mapdl.exit()  # Kill the MAPDL instance

PyMAPDL automatically sets MAPDL to read the job configuration (including machines, number of CPUs, and memory), which allows MAPDL to use all the resources allocated to that job.

Tight integration between MAPDL and the HPC scheduler#

Since v0.68.5, PyMAPDL can take advantage of the tight integration between the scheduler and MAPDL to read the job configuration and launch an MAPDL instance that can use all the resources allocated to that job. For instance, if a SLURM job has allocated 8 nodes with 4 cores each, then PyMAPDL launches an MAPDL instance that uses 32 cores spawning across those 8 nodes.

This behavior can turn off by passing the PYMAPDL_RUNNING_ON_HPC environment variable with a 'false' value or passing the detect_hpc=False argument to the launch_mapdl() function.

Alternatively, you can override these settings by either specifying custom settings in the launch_mapdl() function’s arguments or using specific environment variables. For more information, see Environment variables.