Difference between revisions of "Slurm"

From Cncz
Jump to navigation Jump to search
Line 264: Line 264:
=== Run a large data reduction job ===
==== Run a large data reduction job ====
This is an example of a job that needs a significant amount of resources on a single specific node (//cn90// in this example).
This is an example of a job that needs a significant amount of resources on a single specific node (//cn90// in this example).

Revision as of 15:34, 6 October 2015

Slurm batch software

The science cluster is switching to slurm for batch management. At the moment the TCM and HEF nodes are all switched to slurm. The other nodes will follow.

Partitions =

Jobs always run within a **partition**. There are currently two partitions. One for the TCM group, called tcm and one for the HEF group, called hef. There is no default partition so you always have to provide the partition as an option either on the commandline as -p partitionname or in the shell script by including a line

 #SBATCH --partition=partitionname

Submitting your first job

To execute your job on the cluster you need to write it in the form of a shell script (don't worry this is easier than it sounds). A shell script in its most basic form is just a list of commands that you would otherwise write on the command line. The first line of the script should contain an instruction telling the system which type of shell is supposed to execute it. Unless you really need something else you should always use bash. So without further ado here is your first shell script.

 #! /bin/bash
 #SBATCH --partition=tcm 
 echo "Hello world!"

Type this into an editor (such as nano or vi) and save it as hello.sh. To execute this on the headnode give it executable permissions (not needed when submitting)

 $ chmod u+x hello.sh

and run it.

 $ ./hello.sh

It will print (//echo//) Hello world to the screen (also called //standard out// or stdout). If anything goes wrong an error message will be send to //standard error// or stderr which in this case is also the screen.

To execute this script as a //job// on one of the compute nodes we submit it to the cluster //scheduler//. This is done with the command sbatch.

 $ sbatch hello.sh

The scheduler will put your job in the named job //partition// and respond by giving you the //job number// (10 in this example).

 Submitted batch job 10

Your job will now wait until a slot on one of the compute nodes becomes available. Then it will be executed and the output is written to a file slurm-10.out in your home directory (unless you specify otherwise as explained later).

Inspecting jobs

At any stage after submitting a job you may inspect its status with squeue.

 $ squeue --job 10

It will print some information including its status which can be:

 * PD pending
 * R running
 * S suspended

it will also show the name of the job script, who submitted it and the time used so far. To see full information about your job use scontrol command

 $ scontrol show job 10

You can get a quick overview of how busy the cluster is with

 $ squeue

which lists all jobs sorted by the nodes they are running on.

More information

You may use the scontrol command to get information about the compute nodes e.g. the number of CPU's, available memory and requestable resources such as gpus if available.

 $ scontrol show nodes

Finally there is the sall command that gives a quick overview of all jobs on the cluster. It supports the -r, -s and -p for running, suspended and pending jobs respectively.

 $ sall -r

Understanding job priority

The scheduler determines which job to start first when resources become available based on the job's relative priority. This is a single floating point number calculated from various criteria.

Job priority is currently calculated based on the following criteria:

 * queue time, jobs that are submitted earlier get a higher priority;
 * jobs relative size, larger jobs requesting more resources such as nodes and CPU's get a higher priority because they are harder to schedule;
 * fair share, the priority of jobs submitted by a user increases or decreases depending on the resources (CPU time) consumed in the last week.

The sall command, when used in combination with the -p flag lists the jobs priority as well as an estimated start time.

Deleting jobs

If, after submitting a job, you discover that you made a mistake and want to delete your job you may do so with.

 $ scancel 10

Submitting multiple jobs

Sometimes you may be able to split your job into multiple independent tasks. For instance when you need to perform the same analysis on multiple data files. Splitting this work up into multiple tasks helps because they can then run simultaneously on different compute nodes. Now this could be accomplished by writing a loop that generates and submits job scripts but an easier approach is to use //array jobs//. When calling sbatch with the --array flag an array job consisting of multiple //tasks// is submitted to the scheduler. The number of the current task is available to the scripts execution environment as $SLURM_ARRAY_TASK_ID.

So for example the following test_array_job.sh script

 #! /bin/bash
 #SBATCH --partition=tcm
 echo "This is task $SLURM_ARRAY_TASK_ID

can be submitted with.

 $ sbatch --array 1-4 ./test_array_job.sh

The four tasks in this job will now be executed independently with the same script, the only difference being the value of $SLURM_ARRAY_TASK_ID. This value can be used in the script to for instance select different input files for your data reduction as in.

 #! /bin/bash
 my_command ~/input-$SLURM_ARRAY_TASK_ID

Do note however that the tasks will not necessarily be executed in order so they really need to be independent!

Limiting the number of simultaneous tasks

If you submit a large array job (containing more than say a hundred tasks) you may want to limit the number of simultaneous task executions. You may do this using the % flag.

 sbatch --array 1-1000%20 ./test_array_job.sh

This will ensure at maximum 20 tasks run at the same time.

Setting scheduler parameters in the job script

Instead of entering all flags to sbatch on the command line one can also choose to write them into the job script itself. For this simply prefix them with #SBATCH. This is especially useful for the following flags.

Changing the default logging location

By default standard output and error are written to a file named by the process number in your home directory. This can be changed with the --output and (optionally) --error flags.

 #! /bin/bash
 # Choose partition to run job in
 #SBATCH --partition=hef  
 # Send output to test.log in your home directory
 #SBATCH --output ~/test.log

Getting email updates about your job

Constantly watching squeue output to see when your job starts running and when it's done is not very efficient. Fortunately you can also request the scheduler to send you email in these situations.

 #! /bin/bash
 #SBATCH --mail-type=ALL
 #SBATCH --mail-user=<username>@science.ru.nl

This will get you email when your job begins, ends or fails. If you only want email when it is done you can use.

 #! /bin/bash
 #SBATCH --mail-type=END
 #SBATCH --mail-user=<username>@science.ru.nl

Time limits

To allow the scheduler to efficiently distribute the available compute time **each job has a one hour time limit**. If your job exceeds its limit (counted from the moment execution started on the compute node) it will be killed automatically. Now what if you know your job will need more than one hour to complete? In this case you can simply request more time with the --time flag.

So for instance if you expect to need 12 hours of compute time for your job add the following to your script.

 #! /bin/bash
 #SBATCH --time=12:00:00

For multi-day walltime you can simply add an extra days field as in --time=DD-HH:MM:SS.


If your job needs an excessive amount of memory to run you should request it using

 #! /bin/bash
 #SBATCH --mem=64G

for 64GB as an example. Please only use this if you know for certain that your job needs much more then 8GB of memory!

Nodes and cores

If your code uses parallelization using threads, OpenMP or MPI for instance, you may want to request more CPU resources. This can be in the form of more than one core on a single machine / node, more than one machine or a combination of both. These resources are requested with the -N flag combined with the -n flag as follows.

 #! /bin/bash
 #SBATCH -N 1 -n 8

requests 8 cores on one node.

 #! /bin/bash
 #SBATCH -N 10 -n 16

requests 16 cores divided over 10 compute nodes (16 cores in total). Note that this helps **only if your code is parallelized**! Even if you think it is, it may still be disabled due to compile time selections so always check! For instance by running it briefly on the head node and inspecting CPU usage with top, if it rises above 100% it uses some kind of parallelization.

For MPI jobs you may only care about the total number of CPU cores used, not about if they are located on one machine or distributed over the system. Then you may use the -n flag by itself instead.

Requesting large amounts of cores and or nodes will reduce the probability that your job can be scheduled at each scheduling interval thereby pushing its start time further into the future so **only request what you need**!

Advanced node selection

Sometimes you may need more control over which nodes are used (for instance when your data is located in its /scratch). You can specify explicit node names with the -w flag (optionally multiple separated by ,) as follows.

 #! /bin/bash
 #SBATCH -N 2 -n 4 -w cn90,cn91

This will request 4 cores on cn90 and cn91.


Here are a couple of example scripts (feel free to add your own). Unless otherwise stated just save them as my_job.sh and submit them to the scheduler with.

 $ sbatch my_job.sh

Run a Python script

This job script, when submitted, executes a Python script ~/my_script.py on a random available node.

 #! /bin/bash
 /usr/bin/python ~/my_script.py

Process a list of data files using an array job

Lets say filenames.txt is a text file containing 100 filenames you wish to process for instance created with

 ls *.dat > filenames.txt

You may now combine array jobs with clever use of the awk command to select the current filename from this file.

Submit the following script with sbatch --array 1-100 my_job_script.sh.

 #! /bin/bash
 # Each task needs 1.5 hours of runtime
 #SBATCH --time=01:30:00
 INPUT_FILE=`awk "NR==$SLURM_ARRAY_TASK_ID" filenames.txt`
 my_command $INPUT_FILE

Run a large data reduction job

This is an example of a job that needs a significant amount of resources on a single specific node (//cn90// in this example).

 #! /bin/bash
 # Request 10 CPU cores, the GPU and 64GB of memory for 7 days
 #SBATCH --partition=tcm
 #SBATCH -N 1 -n 10
 #SBATCH -w cn90
 #SBATCH --mem=64GB
 #SBATCH --time=7-00:00:00
 # Get email updates
 #SBATCH --mail-type=ALL
 #SBATCH --mail-user=<username>@science.ru.nl

this assumes you have your data stored in /scratch (local storage) on cn90.science.ru.nl. If your application is not IO limited simply store your data in /vol/tcm# or your home directory ~/ instead and remove the -w line to allow the scheduler to pick a random node.

Run an OpenMP job

OpenMP is frequently used for multi threaded parallel programming. To use OpenMP on the cluster make sure your code is compiled using the -fopenmp flag. Then any CPU cores requested will be automatically picked up (e.g. no need to set the OMP_NUM_THREADS flag). In fact if you do set the flag it will probably lead to slower running code since forcing the number of threads to be higher than the number of CPU cores available leads to overhead.

So the following example code for a parallel for loop:

 #include <omp.h>
 int main(int argc, char *argv[]) {
   const int N = 100000;
   int i, a[N];

   #pragma omp parallel for
   for (i = 0; i < N; i++)
     a[i] = 2 * i;

   return 0;

can be compiled and run on the cluster with the following job script.

 #! /bin/bash
 # Request 4 CPU cores for this OpenMP code
 #SBATCH -N 1 -n 4
 # First compile it
 gcc -O2 -mtune=native -fopenmp my_code.c -o my_code 
 # and run

Run an MPI job

The cluster scheduler has full built in support for OpenMPI therefore it is unnecessary to specify the --hostfile, --host, or -np options to mpirun. Instead you can simply call mpirun from your job script and it will be aware of all nodes and cores requested and uses them accordingly.

 #! /bin/bash
 # Request 32 processor cores randomly distributed over nodes
 #SBATCH -n 32
 # And 12 hours of runtime
 #SBATCH --time=12:00:00
 # Get email when it's done
 #SBATCH --mail-type=END
 #SBATCH --mail-user=<username>@astro.ru.nl
 mpirun my_mpi_application

Copy data to local scratch from external server

This copies data from an external location to local scratch storage using SCP. If this takes a significant amount of time, or has to be done regularly you may want to submit a job to the scheduler to do it //overnight//.

Generate keys

    • You only need to do this once!**

The job script, when executed, cannot ask you for your password so we need to setup access via a public-private keypair first. Create a new keypair on the headnode with

 ssh-keygen -t rsa

and hit enter a bunch of times (do **not** enter a password or it will attempt to ask the script again). Now copy your public key to the remote host containing the data.

 ssh-copy-id -i ~/.ssh/id_rsa.pub 

Copy the data

Now you may submit the following script to the scheduler.

 #! /bin/bash
 #SBATCH --partition=tcm
 # Select the node
 #SBATCH -w cn90
 # Get email when it's done
 #SBATCH --mail-type=END
 #SBATCH --mail-user=<username>@science.ru.nl
 scp -r :/path/over/there /scratch/path/over/here

Interactive jobs

If you need interactive access (i.e. a shell) you can request the scheduler for this with the following steps.

1. Request an allocation of resources (for instance 1 core for 2 hours on a random node):

 salloc -c 1 --partition=tcm --time 2:00:00

2. Attach a shell to this allocation:

 srun --pty bash

You then get a prompt on an available node and can start working. This shell will be automatically closed after two hours.