PyMAPDL on HPC clusters#
Introduction#
PyMAPDL communicates with MAPDL using the gRPC protocol.
This protocol offers the many advantages and features described in
see PyMAPDL project.
One of these features is that it is not required to have both
PyMAPDL and MAPDL processes running on the same machine.
This possibility opens the door to many configurations, depending
on whether or not you run them both on the HPC compute nodes.
Additionally, you might be able interact with them (interactive
mode)
or not (batch
mode).
PyMAPDL takes advantage of HPC clusters to launch MAPDL instances with increased resources. PyMAPDL automatically sets these MAPDL instances to read the scheduler job configuration (which includes machines, number of CPUs, and memory), which allows MAPDL to use all the resources allocated to that job. For more information, see Tight integration between MAPDL and the HPC scheduler.
The following configurations are supported:
Batch job submission from the login node#
Many HPC clusters allow their users to log into a machine using
ssh
, vnc
, rdp
, or similar technologies and then submit a job
to the cluster from there.
This login machine, sometimes known as the head node or entrypoint node,
might be a virtual machine (VDI/VM).
In such cases, once the Python virtual environment with PyMAPDL is already
set and is accessible to all the compute nodes, launching a
PyMAPDL job from the login node is very easy to do using the sbatch
command.
When the sbatch
command is used, PyMAPDL runs and launches an MAPDL instance in
the compute nodes.
No changes are needed on a PyMAPDL script to run it on an SLURM cluster.
First the virtual environment must be activated in the current terminal.
user@entrypoint-machine:~$ export VENV_PATH=/my/path/to/the/venv
user@entrypoint-machine:~$ source $VENV_PATH/bin/activate
Once the virtual environment is activated, you can launch any Python
script that has the proper Python shebang (#!/usr/bin/env python3
).
For instance, assume that you want to launch the following main.py
Python script:
#!/usr/bin/env python3
from ansys.mapdl.core import launch_mapdl
mapdl = launch_mapdl(run_location="/home/ubuntu/tmp/tmp/mapdl", loglevel="debug")
print(mapdl.prep7())
print(f'Number of CPU: {mapdl.get_value("ACTIVE", 0, "NUMCPU")}')
mapdl.exit()
You can run this command in your console:
(venv) user@entrypoint-machine:~$ sbatch main.py
Alternatively, you can remove the shebang from the Python file and use a Python executable call:
(venv) user@entrypoint-machine:~$ sbatch python main.py
Additionally, you can change the number of cores used in your
job by setting the PYMAPDL_NPROC
environment variable to the desired value.
(venv) user@entrypoint-machine:~$ PYMAPDL_NPROC=4 sbatch main.py
For more applicable environment variables, see Environment variables.
You can also add sbatch
options to the command:
(venv) user@entrypoint-machine:~$ PYMAPDL_NPROC=4 sbatch main.py
For instance, to launch a PyMAPDL job that starts a four-core MAPDL instance on a 10-CPU SLURM job, you can run this command:
(venv) user@entrypoint-machine:~$ PYMAPDL_NPROC=4 sbatch --partition=qsmall --nodes=10 --ntasks-per-node=1 main.py
Using a submission script#
If you need to customize your PyMAPDL job further, you can create a SLURM submission script for submitting it. In this case, you must create two files:
Python script with the PyMAPDL code
Bash script that activates the virtual environment and calls the Python script
from ansys.mapdl.core import launch_mapdl
# Number of processors must be lower than the
# number of CPU allocated for the job.
mapdl = launch_mapdl(nproc=10)
mapdl.prep7()
n_proc = mapdl.get_value("ACTIVE", 0, "NUMCPU")
print(f"Number of CPU: {n_proc}")
mapdl.exit()
#!/bin/bash
# Set SLURM options
#SBATCH --job-name=ansys_job # Job name
#SBATCH --partition=qsmall # Specify the queue/partition name
#SBATCH --nodes=5 # Number of nodes
#SBATCH --ntasks-per-node=2 # Number of tasks (cores) per node
#SBATCH --time=04:00:00 # Set a time limit for the job (optional but recommended)
# Set env vars
export MY_ENV_VAR=VALUE
# Activate Python virtual environment
source /home/user/.venv/bin/activate
# Call Python script
python main.py
To start the simulation, you use this code:
user@machine:~$ sbatch job.sh
In this case, the Python virtual environment does not need to be activated before submission since it is activated later in the script.
The expected output of the job follows:
Number of CPU: 10.0
The bash script allows you to customize the environment before running the Python script. This bash script performs tasks such as creating environment variables, moving files to different directories, and printing to ensure your configuration is correct.
Interactive MAPDL instance launched from the login node#
Starting the instance#
If you are already logged in a login node, you can launch an MAPDL instance as a SLURM job and connect to it. To accomplish this, run these commands in your login node.
>>> from ansys.mapdl.core import launch_mapdl
>>> mapdl = launch_mapdl(launch_on_hpc=True)
PyMAPDL submits a job to the scheduler using the appropriate commands.
In case of SLURM, it uses the sbatch
command with the --wrap
argument
to pass the MAPDL command line to start.
Other scheduler arguments can be specified using the scheduler_options
argument as a Python dict
:
>>> from ansys.mapdl.core import launch_mapdl
>>> scheduler_options = {"nodes": 10, "ntasks-per-node": 2}
>>> mapdl = launch_mapdl(launch_on_hpc=True, nproc=20, scheduler_options=scheduler_options)
Note
PyMAPDL cannot infer the number of CPUs that you are requesting from the scheduler.
Hence, you must specify this value using the nproc
argument.
The double minus (--
) common in the long version of some scheduler commands
are added automatically if PyMAPDL detects it is missing and the specified
command is long more than 1 character in length).
For instance, the ntasks-per-node
argument is submitted as --ntasks-per-node
.
Or, a single Python string (str
) is submitted:
>>> from ansys.mapdl.core import launch_mapdl
>>> scheduler_options = "-N 10"
>>> mapdl = launch_mapdl(launch_on_hpc=True, scheduler_options=scheduler_options)
Warning
Because PyMAPDL is already using the --wrap
argument, this argument
cannot be used again.
The values of each scheduler argument are wrapped in single quotes (’). This might cause parsing issues that can cause the job to fail after successful submission.
PyMAPDL passes all the environment variables of the
user to the new job and to the MAPDL instance.
This is usually convenient because many environmental variables are
needed to run the job or MAPDL command.
For instance, the license server is normally stored in the ANSYSLMD_LICENSE_FILE
environment variable.
If you prefer not to pass these environment variables to the job, use the SLURM argument
--export
to specify the desired environment variables.
For more information, see SLURM documentation.
Working with the instance#
Once the Mapdl
object has been created,
it does not differ from a normal Mapdl
instance.
You can retrieve the IP of the MAPDL instance as well as its hostname:
>>> mapdl.ip
'123.45.67.89'
>>> mapdl.hostname
'node0'
You can also retrieve the SLURM job ID:
>>> mapdl.jobid
10001
If you want to check whether the instance has been launched using a scheduler,
you can use the mapdl_on_hpc
attribute:
>>> mapdl.mapdl_on_hpc
True
Exiting MAPDL#
Exiting MAPDL, either intentionally or unintentionally, stops the job. This behavior occurs because MAPDL is the main process at the job. Thus, when finished, the scheduler considers the job done.
To exit MAPDL, you can use the exit()
method.
This method exits MAPDL and sends a signal to the scheduler to cancel the job.
mapdl.exit()
When the Python process you are running PyMAPDL on finishes without errors, and you have not
issued the exit()
method, the garbage collector
kills the MAPDL instance and its job. This is intended to save resources.
If you prefer that the job is not killed, set the following attribute in the
Mapdl
class:
mapdl.finish_job_on_exit = False
In this case, you should set a timeout in your job to avoid having the job running longer than needed.
Handling crashes on an HPC#
If MAPDL crashes while running on an HPC, the job finishes right away. In this case, MAPDL disconnects from MAPDL. PyMAPDL retries to reconnect to the MAPDL instance up to 5 times, waiting for up to 5 seconds. If unsuccessful, you might get an error like this:
MAPDL server connection terminated unexpectedly while running:
/INQUIRE,,DIRECTORY,,
called by:
_send_command
Suggestions:
MAPDL *might* have died because it executed a not-allowed command or ran out of memory.
Check the MAPDL command output for more details.
Open an issue on GitHub if you need assistance: https://github.com/ansys/pymapdl/issues
Error:
failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50052: Failed to connect to remote host: connect: Connection refused (111)
Full error:
<_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50052: Failed to connect to remote host: connect: Connection refused (111)"
debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-10-24T08:25:04.054559811+00:00", grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50052: Failed to connect to remote host: connect: Connection refused (111)"}"
>
The data of that job is available at directory
.
You should set the run location using the run_location
argument.
While handling this exception, PyMAPDL also cancels the job to avoid resources leaking.
Therefore, the only option is to start a new instance by launching a new job using
the launch_mapdl()
function.
User case on a SLURM cluster#
Assume a user wants to start a remote MAPDL instance in an HPC cluster
to interact with it.
The user would like to request 10 nodes, and 1 task per node (to avoid clashes
between MAPDL instances).
The user would like to also request 64 GB of RAM.
Because of administration logistics, the user must use the machines in
the supercluster01
partition.
To make PyMAPDL launch an instance like that on SLURM, run the following code:
from ansys.mapdl.core import launch_mapdl
from ansys.mapdl.core.examples import vmfiles
scheduler_options = {
"nodes": 10,
"ntasks-per-node": 1,
"partition": "supercluster01",
"memory": 64,
}
mapdl = launch_mapdl(launch_on_hpc=True, nproc=10, scheduler_options=scheduler_options)
num_cpu = mapdl.get_value("ACTIVE", 0, "NUMCPU") # It should be equal to 10
mapdl.clear() # Not strictly needed.
mapdl.prep7()
# Run an MAPDL script
mapdl.input(vmfiles["vm1"])
# Let's solve again to get the solve printout
mapdl.solution()
output = mapdl.solve()
print(output)
mapdl.exit() # Kill the MAPDL instance
PyMAPDL automatically sets MAPDL to read the job configuration (including machines, number of CPUs, and memory), which allows MAPDL to use all the resources allocated to that job.
Tight integration between MAPDL and the HPC scheduler#
Since v0.68.5, PyMAPDL can take advantage of the tight integration between the scheduler and MAPDL to read the job configuration and launch an MAPDL instance that can use all the resources allocated to that job. For instance, if a SLURM job has allocated 8 nodes with 4 cores each, then PyMAPDL launches an MAPDL instance that uses 32 cores spawning across those 8 nodes.
This behavior can turn off by passing the
PYMAPDL_RUNNING_ON_HPC
environment variable
with a 'false'
value or passing the detect_hpc=False
argument
to the launch_mapdl()
function.
Alternatively, you can override these settings by either specifying
custom settings in the launch_mapdl()
function’s arguments or using specific environment variables.
For more information, see Environment variables.