Humboldt-Universität zu Berlin - Faculty of Mathematics and Natural Sciences - IT Service Group

Slurm

Introduction

A queue system is installed on the computers of the PC pool and some workgroups to manage resources in computing tasks. The software used is Slurm. In the following the use of Slurm is described.

A queue system can be used to queue more computationally intensive jobs so that they are executed as soon as enough resources are available.

Each PC is a node on which a so-called jobs, i.e. one or more programs, are executed. A job can also run in parallel on several nodes. Each node is basically a resource consisting of a number of CPU cores and a certain amount of RAM.

To run a job on one or more nodes, one only needs to log in to one of the devices involved. This works both locally and via ssh.

 

 

Commands (selection)

Slurm provides a variety of commands, of which the following should be the most useful for most users:

 
Information about nodes:

  • sinfo -N -l lists the nodes and their status. Here you can also directly see the different types of computing nodes and their availability.
$ sinfo -l
PARTITION   AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE NODELIST
interactive    up    2:00:00        1-2   no       NO        all     57        idle adlershof,alex,bernau,britz,buch,buckow,chekov,dahlem,dax,dukat,erkner,forst,frankfurt,garak,gatow,gruenau[1-2,5-10],guben,karow,kes,kira,kudamm,lankwitz,marzahn,mitte,nauen,nog,odo,pankow,picard,pille,potsdam,prenzlau,quark,rudow,scotty,seelow,sisko,spandau,staaken,steglitz,sulu,tegel,templin,treptow,troi,uhura,wandlitz,wannsee,wedding,wildau
std*          up 4-00:00:00       1-16   no       NO        all     57        idle adlershof,alex,bernau,britz,buch,buckow,chekov,dahlem,dax,dukat,erkner,forst,frankfurt,garak,gatow,gruenau[1-2,5-10],guben,karow,kes,kira,kudamm,lankwitz,marzahn,mitte,nauen,nog,odo,pankow,picard,pille,potsdam,prenzlau,quark,rudow,scotty,seelow,sisko,spandau,staaken,steglitz,sulu,tegel,templin,treptow,troi,uhura,wandlitz,wannsee,wedding,wildau
gpu            up 4-00:00:00       1-16   no       NO        all     37        idle adlershof,alex,bernau,britz,buch,buckow,dahlem,erkner,forst,frankfurt,gatow,gruenau[1-2,9-10],guben,karow,kudamm,lankwitz,marzahn,mitte,nauen,pankow,potsdam,prenzlau,rudow,seelow,spandau,staaken,steglitz,tegel,templin,treptow,wandlitz,wannsee,wedding,wildau
gruenau        up 5-00:00:00        1-2   no       NO        all      8        idle gruenau[1-2,5-10]
pool           up 4-00:00:00       1-16   no       NO        all     49        idle adlershof,alex,bernau,britz,buch,buckow,chekov,dahlem,dax,dukat,erkner,forst,frankfurt,garak,gatow,guben,karow,kes,kira,kudamm,lankwitz,marzahn,mitte,nauen,nog,odo,pankow,picard,pille,potsdam,prenzlau,quark,rudow,scotty,seelow,sisko,spandau,staaken,steglitz,sulu,tegel,templin,treptow,troi,uhura,wandlitz,wannsee,wedding,wildau
  • scontrol show node [NODENAME] shows a very detailed overview of all nodes or a single node. Here you can see all features a node offers. You can also see the current load.
$ scontrol show node adlershof
NodeName=adlershof Arch=x86_64 CoresPerSocket=4
   CPUAlloc=0 CPUTot=8 CPULoad=0.00
   AvailableFeatures=intel,avx2,skylake
   ActiveFeatures=intel,avx2,skylake
...

 

Information about submitting jobs:

  • sbatch JOBSCRIPT queues a job script.
  • srun PARAMETER runs a job with parameters ad-hoc. This should only be seen as a replacement to sbatch or as a test command. Examples of srun / sbatch commands can be found further down the page.

 
Information about running jobs:

  • squeue shows the contents of queues.
  • scontrol show job JOBNUMBER displays information about a specific job
  • scancel JOBNUMBER cancels a specific job


More useful commands and parameters can be found in the Slurm Cheat Sheet.

 

 

Partitions

Depending on the requirements of the program, different queues (called partitions by Slurm) are available. Here is an overview:

PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
interactive    up    2:00:00     57   idle adlershof,alex,bernau,britz,buch,buckow,chekov,dahlem,dax,dukat,erkner,forst,frankfurt,garak,gatow,gruenau[1-2,5-10],guben,karow,kes,kira,kudamm,lankwitz,marzahn,mitte,nauen,nog,odo,pankow,picard,pille,potsdam,prenzlau,quark,rudow,scotty,seelow,sisko,spandau,staaken,steglitz,sulu,tegel,templin,treptow,troi,uhura,wandlitz,wannsee,wedding,wildau
std*           up 4-00:00:00     57   idle adlershof,alex,bernau,britz,buch,buckow,chekov,dahlem,dax,dukat,erkner,forst,frankfurt,garak,gatow,gruenau[1-2,5-10],guben,karow,kes,kira,kudamm,lankwitz,marzahn,mitte,nauen,nog,odo,pankow,picard,pille,potsdam,prenzlau,quark,rudow,scotty,seelow,sisko,spandau,staaken,steglitz,sulu,tegel,templin,treptow,troi,uhura,wandlitz,wannsee,wedding,wildau
gpu            up 4-00:00:00     37   idle adlershof,alex,bernau,britz,buch,buckow,dahlem,erkner,forst,frankfurt,gatow,gruenau[1-2,9-10],guben,karow,kudamm,lankwitz,marzahn,mitte,nauen,pankow,potsdam,prenzlau,rudow,seelow,spandau,staaken,steglitz,tegel,templin,treptow,wandlitz,wannsee,wedding,wildau
gruenau        up 5-00:00:00      8   idle gruenau[1-2,5-10]
pool           up 4-00:00:00     49   idle adlershof,alex,bernau,britz,buch,buckow,chekov,dahlem,dax,dukat,erkner,forst,frankfurt,garak,gatow,guben,karow,kes,kira,kudamm,lankwitz,marzahn,mitte,nauen,nog,odo,pankow,picard,pille,potsdam,prenzlau,quark,rudow,scotty,seelow,sisko,spandau,staaken,steglitz,sulu,tegel,templin,treptow,troi,uhura,wandlitz,wannsee,wedding,wildau

The default queue over all available machines (pool + greenau server) is called std. Whenever no partition is explicitly named in a job description, std is automatically selected.

Furthermore, there is a queue interactive, which is limited both in the time limit and in the number of nodes that can be used simultaneously. This queue has a higher priority when processing the jobs and is therefore suitable for test runs or configuration tasks.

Note: All nodes that are part of the pool are only partially utilized by Slurm during the day (9am-5pm). This restriction does not apply to compute nodes.

$ srun --partiton=interactive -n 1 --pty- bash -i

The queues can also be filtered by specifying certain resources (such as AVX512, GPU, ...) as a condition. In the following example, a node with a GPU is requested:

$ srun -n 1 --gres=gpu:1 ...

Description of the partitions:

  1. defq: default partition. Used if no partition is specified in the script. All nodes are contained here.
  2. interactive: Partition for testing jobs. Only interactive jobs (using srun + matching parameters) are allowed here. Allowed time is max. 2h.
  3. gpu: GPU partition. All nodes have at least 1 GPU. To actually request a GPU as well, this must be specified using gres. This can be the restriction to a model, driver, memory or a number of GPUs. Jobs here have a higher priority for GPU programs.
  4. pool: Pool partition. All pool machines are in here, otherwise analogous to defq.
  5. gruenau: Greenhouse partition. Here all greenau computers are included, otherwise analogous to defq. A maximum of two greenhouses can be allocated at the same time.

 

Use of sbatch

Using sbatch, predefined job scripts can be submitted. The entire configuration of the job and the commands to be processed are described here exclusively in the script. Typically, job scripts are written as shell scripts, so the first line should look like this:

#!/bin/bash

In the following parameters are inserted, which concern the configuration. Here also conditions and dependencies can be defined. Each configuration line starts with the magicword #SBATCH. Example:

### Let Slurm allocate 4 nodes 
#SBATCH --nodes=4 

squeue --jobs=<ID> returns more information about the job inside the queue. If no output parameter is specified, Slurm creates an output file slurm_<JOB-ID>.out in the execution folder.

 

If a combination of resource requests is not supported by the selected partition, Slurm returns an appropriate error message when running sbatch.

 

Important parameters:
Parameter Funktion
--job-name=<name> Jobname. If none is given, slurm generates one.
--output=<path> Output path for both results and errors. If none is given, both outputs are written to the folder of execution.
--time=<runlimit> Runtime-Limit in hours:min:sec. When running out of time, the job is automatically killed.
--mem=<memlimit> Main memory allocated per node.
--nodes=<# of nodes> Number of nodes to be allocated.
--partition=<partition> Sets the partition on which the job should run. In none is given, the default partition is used.
--gres=<gres> Used for allocating hardware ressources like GPUs.

Parallel Programming (Open MP)
 
--cpus-per-task=<num_threads> Number of threads per task. If for example a node has 4 cores (without Hyperthreads) and all should be used, the parameter should be set to cpus-per-task=4
--ntasks-per-core=<num_hyperthreads> Number of hyperthreads per CPU-core. Values >1 enable hyperthreading if available (not every CPU in pool support HT)
--ntasks-per-node=1 Recommended setting for OpenMP (without MPI)

Parallel Programming (MPI)
 
--ntasks-per-node=<num_procs> Number of tasks per Node. In case of MPI-only parallel programming this number should be the same as the number of CPU-cores (without Hyperthreading).
--ntasks-per-core=1 Recommended setting for MPI (without OpenMP).
--cpus-per-task=1

Recommended setting for MPI (without OpenMP).

 

Beispielskripte (sbatch)

Hello World (Print hostname)

hello_v1.sh: Four nodes return their hostnames once each.

#!/bin/bash

# Job name
#SBATCH --job-name=hello-slurm
# Number of Nodes
#SBATCH --nodes=4
# Number of processes per Node
#SBATCH --ntasks-per-node=1
# Number of CPU-cores per task
#SBATCH --cpus-per-task=1

srun hostname

 

Output:

adlershof
alex
britz
buch

 

hello_v2.sh: Two nodes return their hostnames twice each.

#!/bin/bash

# Job name
#SBATCH --job-name=hello-slurm
# Number of Nodes
#SBATCH --nodes=2
# Number of processes per Node
#SBATCH --ntasks-per-node=2
# Number of CPU-cores per task
#SBATCH --cpus-per-task=1

srun hostname

 

Output:

adlershof
adlershof
alex
alex

 

Parallel Programming (OpenMP)

openmp.sh: Here a program with four threads is executed on one node. To do this, the number of requested CPU cores (without hyperthreads) is first set to four. This number is then passed on to OpenMP.

#!/bin/bash

# Job name
#SBATCH --job-name=openmp-slurm
# Number of Nodes
#SBATCH --nodes=1
# Number of processes per Node
#SBATCH --ntasks-per-node=1
# Number of CPU-cores per task
#SBATCH --cpus-per-task=4
# Disable Hyperthreads
#SBATCH --ntasks-per-core=1

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

srun ./my_openmp_program

 

Parallel Programming (MPI)

hello_mpi.sh: Similar to the Hello World example, here all involved nodes return an output. The code for this example can be found here. The communication and synchronization is done via MPI. srun offers several protocols for the transfer of MPI data. You can get a list of the supported protocols with the following command:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmix_v3
srun: pmix
srun: pmi2

In addition to the protocol an MPI implementation (see section mpi-selector) must be selected. Not all MPI implementations support every transmission protocol. A good overview of available combinations and best practices can be found here.


The following script starts four processes on each of 2 nodes, which communicate with each other via pmix_v3. The code was previously compiled using OpenMPI 4: mpic++ mpi_hello.cpp -o mpi_hello

!/bin/bash

# Job Name
#SBATCH --job-name=mpi-hello
# Number of Nodes
#SBATCH --nodes=2
# Number of processes per Node
#SBATCH --ntasks-per-node=4
# Number of CPU-cores per task
#SBATCH --cpus-per-task=1

# Kompiliert mit OpenMPI 4
srun --mpi=pmix_v3  mpi_hello

 

Output:

Hello world from processor gatow, rank 2 out of 8 processors
Hello world from processor gatow, rank 3 out of 8 processors
Hello world from processor gatow, rank 0 out of 8 processors
Hello world from processor gatow, rank 1 out of 8 processors
Hello world from processor karow, rank 4 out of 8 processors
Hello world from processor karow, rank 5 out of 8 processors
Hello world from processor karow, rank 6 out of 8 processors
Hello world from processor karow, rank 7 out of 8 processors

 

Mixed Parallel Programming (OpenMP + MPI)

hello_hybrid.sh: There is also the possibility to combine OpenMP and MPI. Each started MPI process can then start multiple threads on multiple CPU cores. The code for this example can be found here. The following Slurm script starts four processes on 2 nodes, which start 2 threads each. The code was previously compiled using OpenMPI 4: mpic++ -fopenmp hybrid_hello.cpp -o hybrid_hello

 

!/bin/bash

# Job Name
#SBATCH --job-name=hybrid-hello
# Number of Nodes
#SBATCH --nodes=2
# Number of processes per Node
#SBATCH --ntasks-per-node=2
# Number of tasks in total
#SBATCH --ntasks=4
# Number of CPU-cores per task
#SBATCH --cpus-per-task=2

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

# Kompiliert mit OpenMPI 4
srun --mpi=pmix_v3 hybrid_hello

 

Output:

Hello from thread 0 of 2 in rank 0 of 4 on gatow
Hello from thread 1 of 2 in rank 0 of 4 on gatow
Hello from thread 1 of 2 in rank 1 of 4 on gatow
Hello from thread 0 of 2 in rank 1 of 4 on gatow
Hello from thread 1 of 2 in rank 2 of 4 on karow
Hello from thread 0 of 2 in rank 2 of 4 on karow
Hello from thread 1 of 2 in rank 3 of 4 on karow
Hello from thread 0 of 2 in rank 3 of 4 on karow

 

 

GPU Programming (Tensorflow)

tensorflow_gpu.sh: To be able to use at least one GPU, a suitable resource must be requested in Slurm using gres. Possible requests can be generic, like a certain minimum number of GPU cards (--gres:gpu:2) or a certain CUDA computing level (--feature=cu80). Alternatively, a specific GPU can be requested. The code for this example can be found here. An overview of all available models and their gres designations or multiplicity can be found on the GPU server overview page.

 

#!/bin/bash

# Job Name
#SBATCH --job-name=tensorflow-gpu
# Number of Nodes
#SBATCH --nodes=1
# Set the GPU-Partition (opt. but recommended)
#SBATCH --partition=gpu
# Allocate node with certain GPU
#SBATCH --gres=gpu:gtx745

# Export necessary libraries
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cuda/lib64/

python mnist_classify.py

 

Output (trunc.):

2021-03-29 12:20:10.976419: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3591 MB memory) -> physical GPU (device: 0, name: GeForce GTX 745, pci bus id: 0000:01:00.0, compute capability: 5.0)
...
Train on 60000 samples
Epoch 1/10
...
10000/10000 - 0s - loss: 1.4880 - acc: 0.9741
('\nTest accuracy:', 0.9741)

 

Folder structure

The path in which both the job script and the executing program are located must be available on all nodes under the same path. The program must therefore be installed in the system or be part of a global file system (see file server overview). For Slurm tasks with specially compiled programs and larger input/output files, /glusterfs/dfs-gfs-dist is recommended. This folder is available on all nodes (both pool and compute). For a better overview, a subfolder should be created with the own user name. By default, the folder is only accessible to the own user.

cd /glusterfs/dfs-gfs-dist
mkdir brandtfa
ls -la
> drwx------    2 brandtfa maks           4096 18. Mär 11:29 brandtfa

For saving the data (especially results of the calculations) the own HOME directory can be used. This can also be used for smaller programs and data sets for the calculation itself. Please note the size limitation here.

 

 

MPI

In the case of multi-process programs that are to run either on one or on several nodes, MPI is used as the communication standard. Here several implementations are installed on all nodes:

  • openmpi
  • openmpi2
  • openmpi4
  • mpich

 

Each of the implementations provides its own headers, blbliotheks and binaries. Programs that are to use MPI must be compiled with the compilers of the respective MPI environment. The special feature here is - after logging in none of the implementations is actively available at first and must be activated. This can be achieved among other things by means of mpi-selector.

# Get list of all installed MPI-versions
$ mpi-selector --list
mpich
openmpi
openmpi2
openmpi4
$ mpi-selector --set openmpi4
$ mpi-selector --query 
default:openmpi4
level:user

 

Hinweis: The MPI-environment is only available after a re-login.

$ mpirun --version
mpirun (Open MPI) 4.0.5.0.88d8972a4085

Report bugs to http://www.open-mpi.org/community/help/

 

 

Best Practices

Use Slurm where possible!

If a program consumes more time and resources and the result of the calculation is not time-critical, it should be executed using the queue system. This ensures a proper utilization of the PC pool, since the resources are managed automatically.

In particular (with the exception of fast computations) no computationally intensive programs should be started via a remote console, because this disturbs not only the person sitting in front of the computer who can hardly work, but also other people using the queue. Slurm has no way to manage directly executed programs. Therefore, over-provisioning of resources can occur when both systems are running in parallel.

 

Partitions are here to help

While in principle all jobs can be processed on the standard std partition, it is recommended to select the appropriate partitions for special requirements. The jobs on the individual partitions such as gpu or gruenau have higher priority, which means that these jobs are processed faster on a node if there are multiple resource requests.

 

Only allocate what you need

It is not checked whether a program really only requires the specified number of cores. However, it is in your interest and that of others if you specify correct values and limit your program to these values. It tries to utilize the resources as best as possible, i.e. if two jobs specify that they need 16 cores each, they can run on a node with 32 cores at the same time. If this specification is not correct and a program uses more cores, the resources can no longer be distributed optimally and both jobs on the node require more time.

 

Set limits

By means of the parameter time you can specify when your program will be terminated at the latest. You should use this to prevent your program from not terminating due to an error and thus blocking nodes. Note: The interactive partition sets an automatic time limit of 2h.

 

Backup your data

It can always happen that a job terminates unintentionally. This can happen because the maximum runtime (time) expires, because there is a bug in the program, or because there is an error on the machine. Therefore, if possible, save intermediate results regularly in order to be able to restart the calculation from this point.

 

Keep everything clean

Data that is no longer needed after the calculation should be deleted at the end of the script.