Slurm

Uit Cncz
Ga naar: navigatie, zoeken

SLURM batch software

The Science cn-cluster is switching to SLURM for batch management. At the moment the TCM and HEF nodes are all switched to SLURM. The other nodes will follow.

Partitions

Jobs always run within a **partition**.

The partitions for groups with their own nodes can only be used by members of these groups. These partitions usually have high priority and can run infinitely long (MaxTime=INFINITE). Jobs in these partitions will suspend jobs submitted in the partition "all" and the low priority partition "heflowprio".

 PartitionName=tcm    Nodes=cn../.. AllowGroups=tcm       Priority=10
 PartitionName=hef    Nodes=cn../.. AllowGroups=ehef,thef Priority=10
 PartitionName=heflowprio Nodes=../.. AllowGroups=ehef,thef Priority=1
 PartitionName=milkun    Nodes=cn../.. AllowGroups=milkun Priority=10
 PartitionName=thchem Nodes=cn../..  AllowGroups=thchem   Priority=10

Jobs in the "cnczshort" partition also get high priority, but they will be killed if they run more than 12 hours.

 PartitionName=cnczshort Nodes=cn13 MaxTime=12:00:00 Priority=10 Preemptmode=REQUEUE

There also is a "cncz" partition, that may be used by all cluster users for jobs that run less than a week, it has low priority:

 PartitionName=cncz Nodes=cn13 Priority=2 MaxTime=7-00:00:00

There also is an "all" partition with all nodes, this has the lowest priority, max 12 hours running jobs and a memory limit of 2 GB:

 PartitionName=all MaxTime=12:00:00 MaxMemPerCPU=2048 Priority=1


It is wise to provide the partition as an option either on the command line as -p partitionname or in the shell script by including a line:

 #SBATCH --partition=partitionname

Submitting your first job

To execute your job on the cluster you need to write it in the form of a shell script (don't worry this is easier than it sounds). A shell script in its most basic form is just a list of commands that you would otherwise write on the command line. The first line of the script should contain an instruction telling the system which type of shell is supposed to execute it. Unless you really need something else you should always use bash. So without further ado here is your first shell script.

#! /bin/bash
#SBATCH --partition=all
sleep 60
echo "Hello world!"

Type this into an editor (such as nano or vi) and save it as hello.sh. To execute this on the headnode give it executable permissions (not needed when submitting)

 $ chmod u+x hello.sh

and run it.

 $ ./hello.sh

It will print (//echo//) Hello world to the screen (also called //standard out// or stdout). If anything goes wrong an error message will be sent to //standard error// or stderr which in this case is also the screen.

To execute this script as a //job// on one of the compute nodes we submit it to the cluster //scheduler//. This is done with the command sbatch.

 $ sbatch hello.sh

The scheduler will put your job in the named job //partition// and respond by giving you the //job number// (10 in this example).

 Submitted batch job 10

Your job will now wait until a slot on one of the compute nodes becomes available. Then it will be executed and the output is written to a file slurm-10.out in your home directory (unless you specify otherwise as explained later).

Inspecting jobs

At any stage after submitting a job, while it is running, you may inspect its status with squeue.

 $ squeue --job 10

It will print some information including its status which can be:

 * PD pending
 * R running
 * S suspended

it will also show the name of the job script, the user who submitted it and the time used so far. To see full information about your job use the scontrol command.

 $ scontrol show job 10

You can get a quick overview of how busy the cluster is with

 $ squeue

which lists all jobs sorted by the nodes they are running on.

To get detailed information on a specific Jobid:

 $ scontrol show job -dd Jobid

You may use the scontrol command to get information about the compute nodes e.g. the number of CPU's, available memory and requestable resources such as gpus if available.

 $ scontrol show nodes

You may use the sinfo command to get information on nodes and partitions.

 $ sinfo -l

A useful overview of the state of all nodes can be achieved with:

 $ sinfo -Nle -o "%.20n %.15C %.8O %.7t" | uniq

Finally there is the sall command that gives a quick overview of all jobs on the cluster. It supports the -r, -s and -p for running, suspended and pending jobs respectively.

 $ sall -r

More information

Understanding job priority

The scheduler determines which job to start first when resources become available based on the job's relative priority. This is a single floating point number calculated from various criteria.

Job priority is currently calculated based on the following criteria:

 * queue time,
   jobs that are submitted earlier get a higher priority;
 * jobs relative size,
   larger jobs requesting more resources such as nodes and CPUs get a higher priority because they are harder to schedule;
 * fair share,
   the priority of jobs submitted by a user increases or decreases depending on the resources (CPU time) consumed in the last week.

With the sprio -w command you can view the current weights used in the priority computation.

Job Priority Formula:

   Job_priority =
   (PriorityWeightAge) * (age_factor) +
   (PriorityWeightFairshare) * (fair-share_factor) +
   (PriorityWeightJobSize) * (job_size_factor) 

The sall command, when used in combination with the -p flag lists the jobs priority as well as an estimated start time.

Deleting jobs

If, after submitting a job, you discover that you made a mistake and want to delete your job you may do so with.

 $ scancel 10


Submitting multiple jobs

Sometimes you may be able to split your job into multiple independent tasks. For instance when you need to perform the same analysis on multiple data files. Splitting this work up into multiple tasks helps because they can then run simultaneously on different compute nodes. Now this could be accomplished by writing a loop that generates and submits job scripts but an easier approach is to use //array jobs//. When calling sbatch with the --array flag an array job consisting of multiple //tasks// is submitted to the scheduler. The number of the current task is available to the scripts execution environment as $SLURM_ARRAY_TASK_ID.

So for example the following test_array_job.sh script

 #! /bin/bash
 #SBATCH --partition=tcm
 echo "This is task $SLURM_ARRAY_TASK_ID

can be submitted with.

 $ sbatch --array 1-4 ./test_array_job.sh

The four tasks in this job will now be executed independently with the same script, the only difference being the value of $SLURM_ARRAY_TASK_ID. This value can be used in the script to for instance select different input files for your data reduction as in.

 #! /bin/bash
 my_command ~/input-$SLURM_ARRAY_TASK_ID

Do note however that the tasks will not necessarily be executed in order so they really need to be independent!

Limiting the number of simultaneous tasks

If you submit a large array job (containing more than say a hundred tasks) you may want to limit the number of simultaneous task executions. You may do this using the % flag.

 sbatch --array 1-1000%20 ./test_array_job.sh

This will ensure at maximum 20 tasks run at the same time.

Setting scheduler parameters in the job script

Instead of entering all flags to sbatch on the command line one can also choose to write them into the job script itself. For this simply prefix them with #SBATCH. This is especially useful for the following flags.

Changing the default logging location

By default standard output and error are written to a file named by the process number in your home directory. This can be changed with the --output and (optionally) --error flags.

 #! /bin/bash
 # Choose partition to run job in
 #SBATCH --partition=hef  
 # Send output to test.log in your home directory
 #SBATCH --output ~/test.log

Getting email updates about your job

Constantly watching squeue output to see when your job starts running and when it's done is not very efficient. Fortunately you can also request the scheduler to send you email in these situations.

 #! /bin/bash
 #SBATCH --mail-type=ALL
 #SBATCH --mail-user=<username>@science.ru.nl

This will get you email when your job begins, ends or fails. If you only want email when it is done you can use.

 #! /bin/bash
 #SBATCH --mail-type=END
 #SBATCH --mail-user=<username>@science.ru.nl

Time limits

To allow the scheduler to efficiently distribute the available compute time **each job has a one hour time limit**. If your job exceeds its limit (counted from the moment execution started on the compute node) it will be killed automatically. Now what if you know your job will need more than one hour to complete? In this case you can simply request more time with the --time flag.

So for instance if you expect to need 12 hours of compute time for your job add the following to your script.

 #! /bin/bash
 #SBATCH --time=12:00:00

For multi-day walltime you can simply add an extra days field as in --time=DD-HH:MM:SS.

Memory

If your job needs an excessive amount of memory to run you should request it using

 #! /bin/bash
 #SBATCH --mem=64G

for 64GB as an example. Please only use this if you know for certain that your job needs much more than 8GB of memory!

Nodes and cores

If your code uses parallelization using threads, OpenMP or MPI for instance, you may want to request more CPU resources. This can be in the form of more than one core on a single machine / node, more than one machine or a combination of both. These resources are requested with the -N flag combined with the -n flag as follows.

 #! /bin/bash
 #SBATCH -N 1 -n 8

requests 8 cores on one node.

 #! /bin/bash
 #SBATCH -N 10 -n 16

requests 16 cores divided over 10 compute nodes (16 cores in total). Note that this helps **only if your code is parallelized**! Even if you think it is, it may still be disabled due to compile time selections so always check! For instance by running it briefly on the head node and inspecting CPU usage with top, if it rises above 100% it uses some kind of parallelization.

For MPI jobs you may only care about the total number of CPU cores used, not about if they are located on one machine or distributed over the system. Then you may use the -n flag by itself instead.

Requesting large amounts of cores and or nodes will reduce the probability that your job can be scheduled at each scheduling interval thereby pushing its start time further into the future so **only request what you need**!

Advanced node selection

Sometimes you may need more control over which nodes are used (for instance when your data is located in its /scratch). You can specify explicit node names with the -w flag (optionally multiple separated by ,) as follows.

 #! /bin/bash
 #SBATCH -N 2 -n 4 -w cn90,cn91

This will request 4 cores on cn90 and cn91.

Examples

Here are a couple of example scripts (feel free to add your own). Unless otherwise stated just save them as my_job.sh and submit them to the scheduler with.

 $ sbatch my_job.sh

Run a Python script

This job script, when submitted, executes a Python script ~/my_script.py on a random available node.

 #! /bin/bash
 
 /usr/bin/python ~/my_script.py

Process a list of data files using an array job

Let's say filenames.txt is a text file containing 100 filenames you wish to process for instance created with

 ls *.dat > filenames.txt

You may now combine array jobs with clever use of the awk command to select the current filename from this file.

Submit the following script with sbatch --array 1-100 my_job_script.sh.

 #! /bin/bash
 #
 # Each task needs 1.5 hours of runtime
 #SBATCH --time=01:30:00
 #
 INPUT_FILE=`awk "NR==$SLURM_ARRAY_TASK_ID" filenames.txt`
 #
 my_command $INPUT_FILE

Run a large data reduction job

This is an example of a job that needs a significant amount of resources on a single specific node (//cn90// in this example).

 #! /bin/bash
 #
 # Request 10 CPU cores and 64GB of memory for 7 days on cn90
 #SBATCH --partition=tcm
 #SBATCH -N 1 -n 10
 #SBATCH --mem=64GB
 #SBATCH --time=7-00:00:00
 #SBATCH -w cn90
 #
 # Get email updates
 #SBATCH --mail-type=ALL
 #SBATCH --mail-user=<username>@science.ru.nl
 #
 run_my_large_job

This assumes you have your data stored in /scratch (local storage) on cn90.science.ru.nl. If your application is not IO limited simply store your data in /vol/tcm# or your home directory ~/ instead and remove the -w line to allow the scheduler to pick a random node.

Request a GPU

Then use:

  # set the number of GPU cards to use per node
  #SBATCH --gres=gpu:1

Run an OpenMP job

OpenMP is frequently used for multithreaded parallel programming. To use OpenMP on the cluster, make sure your code is compiled using the -fopenmp flag. Then any CPU cores requested will be automatically picked up (e.g. no need to set the OMP_NUM_THREADS flag). In fact if you do set the flag it will probably lead to slower running code since forcing the number of threads to be higher than the number of CPU cores available leads to overhead.

So the following example code for a parallel "for" loop:

 #include <omp.h>
 int main(int argc, char *argv[]) {
   const int N = 100000;
   int i, a[N];

   #pragma omp parallel for
   for (i = 0; i < N; i++)
     a[i] = 2 * i;

   return 0;
 }

can be compiled and run on the cluster with the following job script.

 #! /bin/bash
 #
 # Request 4 CPU cores for this OpenMP code
 #SBATCH -N 1 -n 4
 #
 # First compile it
 gcc -O2 -mtune=native -fopenmp my_code.c -o my_code 
 #
 # and run
 ./my_code

Run an MPI job

The cluster scheduler has full built in support for OpenMPI therefore it is unnecessary to specify the --hostfile, --host, or -np options to mpirun. Instead you can simply call mpirun from your job script and it will be aware of all nodes and cores requested and uses them accordingly.

 #! /bin/bash
 #
 # Request 32 processor cores randomly distributed over nodes
 #SBATCH -n 32
 #
 # And 12 hours of runtime
 #SBATCH --time=12:00:00
 #
 # Get email when it's done
 #SBATCH --mail-type=END
 #SBATCH --mail-user=<username>@science.ru.nl
 #
 mpirun my_mpi_application

Copy data to local scratch from external server

This copies data from an external location to local scratch storage using SCP. If this takes a significant amount of time, or has to be done regularly you may want to submit a job to the scheduler to do it //overnight//.

Generate keys (** You only need to do this once!**)

The job script, when executed, cannot ask you for your password so we need to setup access via a public-private keypair first. Create a new keypair on the headnode with

 ssh-keygen -t rsa

and hit enter a bunch of times (do **not** enter a password or it will attempt to ask the script again). Now copy your public key to the remote host containing the data.

 ssh-copy-id -i ~/.ssh/id_rsa.pub user@server.example.com

Copy the data

Now you may submit the following script to the scheduler.

 #! /bin/bash
 #
 #SBATCH --partition=tcm
 # Select the node
 #SBATCH -w cn90
 #
 # Get email when it's done
 #SBATCH --mail-type=END
 #SBATCH --mail-user=<username>@science.ru.nl
 #
 scp -r user@server.example.com:/path/over/there /scratch/path/over/here

Interactive jobs

If you need interactive access (i.e. a shell) you can request the scheduler for this with the following steps.

1. Request an allocation of resources (for instance 1 core for 2 hours on a random node):

 salloc -c 1 --partition=tcm --time 2:00:00

2. Attach a shell to this allocation:

 srun --pty bash

You then get a prompt on an available node and can start working. This shell will be automatically closed after two hours.