Introduction to SLURM#

High performance computing (HPC) clusters are powerful systems designed to handle computationally intensive tasks efficiently. SLURM (Simple Linux Utility for Resource Management) is one of the most widely used job schedulers in HPC environments. This page provides an overview of job submission using PyMAPDL and SLURM on HPC clusters.

What is SLURM?#

SLURM is an open source workload manager and job scheduler designed for Linux clusters of all sizes. It efficiently allocates resources (compute nodes, CPU cores, memory, and GPUs) to jobs submitted by users.

For more information on SLURM, see the SLURM documentation.

Basic terms#

Descriptions follow of basic terms.

  • Nodes: Individual computing servers within the cluster.

  • Compute node: A type of node used only for running processes. It is not accessible from outside the cluster.

  • Login node: A type of node used only for login and job submission. No computation should be performed on it. It is sometimes referred to as virtual desktop infrastructure (VDI).

  • Partition: A logical grouping of nodes with similar characteristics (for example, CPU architecture and memory size).

  • Job: A task submitted to SLURM for execution.

  • Queue: A waiting area where jobs are held until resources become available.

  • Scheduler: The component responsible for deciding which job gets executed and when and where it gets executed.

Regular job submission workflow#

Log into the cluster#

You need access credentials and permissions to log in and submit jobs on the HPC cluster. Depending on the login node configuration, you can log in using Virtual Network Computing (VNC) applications or a terminal.

For example, you can log in to a login node using the terminal:

user@machine:~$ ssh username@login-node-hostname

Writing a SLURM batch script#

A SLURM batch script is a shell script that specifies job parameters and commands to execute. Here’s a basic example:

my_script.sh

#!/bin/bash
#SBATCH --job-name=myjob
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --time=01:00:00

# Commands to run
echo "Hello, SLURM!"
srun my_executable

This script is run using srun and sbatch. Notice how comments in the file prefixed with #SBATCH describe the job configuration. For more information on available srun and sbatch arguments, see Slurm Workload Manager - srun and Slurm Workload Manager - sbatch.

Submitting a job#

To submit a job, use the srun command followed by the name of the batch script:

user@machine:~$ srun my_script.sh

If you prefer to submit a batch job, you can use the sbatch command:

user@machine:~$ sbatch my_script.sh

You can specify each job setting using the command line. For example:

user@machine:~$ srun --nodes=2 my_script.sh

Warning

Command line arguments versus in-file arguments: Command line arguments do NOT overwrite the equivalent arguments written in the bash file. Hence, you must ensure that the argument that you want to pass using the command line is not already present in the bash file.

Monitoring jobs#

View the job queue#

The squeue command displays information about jobs that are currently queued or running on the system.

Basic usage:

squeue

To see jobs from a specific user:

squeue -u username

To filter jobs by partition:

squeue -p partition_name

Common options:

  • -l or --long: Displays detailed information about each job.

  • --start: Predicts and shows the start times for pending jobs.

Control the jobs and configuration#

The scontrol command provides a way to view and modify the SLURM configuration and state. It’s a versatile tool for managing jobs, nodes, partitions, and more.

Show information about a job:

scontrol show job <jobID>

Show information about a node:

scontrol show node nodename

Hold and release jobs:

  • To hold a job (stop it from starting): scontrol hold <jobID>

  • To release a job on hold: scontrol release <jobID>

Cancel jobs#

The scancel command cancels a running or pending job.

Cancel a specific job:

scancel <jobID>

Cancel all jobs of a specific user:

scancel -u username

Cancel jobs by partition:

scancel -p partition_name

Common options:

  • --name=jobname: Cancels all jobs with a specific name.

  • --state=pending: Cancels all jobs in a specific state, such as all pending jobs as shown.

Report accounting Information#

The sacct account reports job or job step accounting information about active or completed jobs.

Basic usage:

sacct

To see information about jobs from a specific user:

sacct -u username

To show information about a specific job or job range:

sacct -j <jobID>
sacct -j <jobID_1>,<jobID_2>

Common options:

  • --format: Specifies which fields to display. For example, --format=JobID,JobName,State.

  • -S and -E: Sets the start and end times for the report. For example, -S 2023-01-01 -E 2023-01-31.

For more information, see the SLURM documentation or use the man command (for example, man squeue) to explore all available options and their usage.

Best practices#

  • Optimize resource usage to minimize job wait times and maximize cluster efficiency.

  • Regularly monitor job queues and system resources to identify potential bottlenecks.

  • Follow naming conventions for batch scripts and job names to maintain organization.

  • Keep batch scripts and job submissions concise and well-documented for reproducibility and troubleshooting.

Advanced configuration#

The following topics provide some advanced ideas for you to explore when using PyMAPDL on HPC clusters. In this section, these topics are just briefly described so you can use online resources such as SLURM documentation.

Advanced job management#

Job dependencies#

Specify dependencies between jobs using the --dependency flag. Jobs can depend on completion, failure, or other criteria of previously submitted jobs.

Array jobs#

Submit multiple jobs as an array using the --array flag. Each array element corresponds to a separate job, allowing for parallel execution of similar tasks.

Job arrays with dependencies#

Combine array jobs with dependencies for complex job scheduling requirements. This allows for parallel execution while maintaining dependencies between individual tasks.

Resource allocation and request#

Specify resources#

Use SLURM directives in batch scripts to specify required resources such as number of nodes, CPU cores, memory, and time limit.

Request resources#

Use the --constraint flag to request specific hardware configurations (for example, CPU architecture) or the --gres flag for requesting generic resources like GPUs.

Resource Limits#

Set resource limits for individual jobs using directives such as --cpus-per-task, --mem, and --time.