Difference between revisions of "Slurm"
|Line 167:||Line 167:|
Revision as of 15:18, 6 October 2015
Slurm batch software
The science cluster is switching to slurm for batch management. At the moment the TCM and HEF nodes are all switched to slurm. The other nodes will follow.
Jobs always run within a **partition**. There are currently two partitions. One for the TCM group, called tcm and one for the HEF group, called hef. There is no default partition so you always have to provide the partition as an option either on the commandline as -p partitionname or in the shell script by including a line
#SBATCH -p partitionname
Submitting your first job
To execute your job on the cluster you need to write it in the form of a shell script (don't worry this is easier than it sounds). A shell script in its most basic form is just a list of commands that you would otherwise write on the command line. The first line of the script should contain an instruction telling the system which type of shell is supposed to execute it. Unless you really need something else you should always use bash. So without further ado here is your first shell script.
#! /bin/bash #SBATCH -p tcm echo "Hello world!"
Type this into an editor (such as nano or vi) and save it as hello.sh. To execute this on the headnode give it executable permissions (not needed when submitting)
$ chmod u+x hello.sh
and run it.
It will print (//echo//) Hello world to the screen (also called //standard out// or stdout). If anything goes wrong an error message will be send to //standard error// or stderr which in this case is also the screen.
To execute this script as a //job// on one of the compute nodes we submit it to the cluster //scheduler//. This is done with the command sbatch.
$ sbatch hello.sh
The scheduler will put your job in the named job //partition// and respond by giving you the //job number// (10 in this example).
Submitted batch job 10
Your job will now wait until a slot on one of the compute nodes becomes available. Then it will be executed and the output is written to a file slurm-10.out in your home directory (unless you specify otherwise as explained later).
At any stage after submitting a job you may inspect its status with squeue.
$ squeue --job 10
It will print some information including its status which can be:
* PD pending * R running * S suspended
it will also show the name of the job script, who submitted it and the time used so far. To see full information about your job use scontrol command
$ scontrol show job 10
You can get a quick overview of how busy the cluster is with
which lists all jobs sorted by the nodes they are running on.
You may use the scontrol command to get information about the compute nodes e.g. the number of CPU's, available memory and requestable resources such as gpus if available.
$ scontrol show nodes
Finally there is the sall command that gives a quick overview of all jobs on the cluster. It supports the -r, -s and -p for running, suspended and pending jobs respectively.
$ sall -r
Understanding job priority
The scheduler determines which job to start first when resources become available based on the job's relative priority. This is a single floating point number calculated from various criteria.
Job priority is currently calculated based on the following criteria:
* queue time, jobs that are submitted earlier get a higher priority; * jobs relative size, larger jobs requesting more resources such as nodes and CPU's get a higher priority because they are harder to schedule; * fair share, the priority of jobs submitted by a user increases or decreases depending on the resources (CPU time) consumed in the last week.
The sall command, when used in combination with the -p flag lists the jobs priority as well as an estimated start time.
If, after submitting a job, you discover that you made a mistake and want to delete your job you may do so with.
$ scancel 10
Submitting multiple jobs
Sometimes you may be able to split your job into multiple independent tasks. For instance when you need to perform the same analysis on multiple data files. Splitting this work up into multiple tasks helps because they can then run simultaneously on different compute nodes. Now this could be accomplished by writing a loop that generates and submits job scripts but an easier approach is to use //array jobs//. When calling sbatch with the --array flag an array job consisting of multiple //tasks// is submitted to the scheduler. The number of the current task is available to the scripts execution environment as $SLURM_ARRAY_TASK_ID.
So for example the following test_array_job.sh script
#! /bin/bash #SBATCH -p tcm echo "This is task $SLURM_ARRAY_TASK_ID
can be submitted with.
$ sbatch --array 1-4 ./test_array_job.sh
The four tasks in this job will now be executed independently with the same script, the only difference being the value of $SLURM_ARRAY_TASK_ID. This value can be used in the script to for instance select different input files for your data reduction as in.
#! /bin/bash my_command ~/input-$SLURM_ARRAY_TASK_ID
Do note however that the tasks will not necessarily be executed in order so they really need to be independent!
Limiting the number of simultaneous tasks
If you submit a large array job (containing more than say a hundred tasks) you may want to limit the number of simultaneous task executions. You may do this using the % flag.
sbatch --array 1-1000%20 ./test_array_job.sh
This will ensure at maximum 20 tasks run at the same time.
Setting scheduler parameters in the job script
Instead of entering all flags to sbatch on the command line one can also choose to write them into the job script itself. For this simply prefix them with #SBATCH. This is especially useful for the following flags.
Changing the default logging location
By default standard output and error are written to a file named by the process number in your home directory. This can be changed with the --output and (optionally) --error flags.
#! /bin/bash # Choose partition to run job in #SBATCH -p hef # Send output to test.log in your home directory #SBATCH --output ~/test.log
Getting email updates about your job
Constantly watching squeue output to see when your job starts running and when it's done is not very efficient. Fortunately you can also request the scheduler to send you email in these situations.
#! /bin/bash #SBATCH --mail-type=ALL #SBATCH --mail-user=<username>@science.ru.nl
This will get you email when your job begins, ends or fails. If you only want email when it is done you can use.
#! /bin/bash #SBATCH --mail-type=END #SBATCH --mail-user=<username>@science.ru.nl
To allow the scheduler to efficiently distribute the available compute time **each job has a one hour time limit**. If your job exceeds its limit (counted from the moment execution started on the compute node) it will be killed automatically. Now what if you know your job will need more than one hour to complete? In this case you can simply request more time with the --time flag.
So for instance if you expect to need 12 hours of compute time for your job add the following to your script.
#! /bin/bash #SBATCH --time=12:00:00
For multi-day walltime you can simply add an extra days field as in --time=DD-HH:MM:SS.