Difference between revisions of "Slurm"

From Cncz
Jump to: navigation, search
(Partitions)
(Submitting your first job)
Line 47: Line 47:
 
Your job will now wait until a slot on one of the compute nodes becomes available.
 
Your job will now wait until a slot on one of the compute nodes becomes available.
 
Then it will be executed and the output is written to a file ''slurm-10.out'' in your home directory (unless you specify otherwise as explained later).
 
Then it will be executed and the output is written to a file ''slurm-10.out'' in your home directory (unless you specify otherwise as explained later).
 +
 +
==== Inspecting jobs ====
 +
 +
At any stage after submitting a job you may inspect its status with ''squeue''.
 +
 +
  $ squeue --job 10
 +
 +
It will print some information including its status which can be:
 +
 +
  * PD pending
 +
  * R running
 +
  * S suspended
 +
 +
it will also show the name of the job script, who submitted it and the time used so far.
 +
To see full information about your job use ''scontrol'' command
 +
 +
  $ scontrol show job 10
 +
 +
You can get a quick overview of how busy the cluster is with
 +
 +
  $ squeue
 +
 +
which lists all jobs sorted by the nodes they are running on.
 +
 +
=== More information ===
 +
 +
You may use the ''scontrol'' command to get information about the compute nodes e.g. the number of CPU's, available memory and requestable resources such as gpus if available.
 +
 +
  $ scontrol show nodes
 +
 +
Finally there is the ''sall'' command that gives a quick overview of all jobs on the cluster. It supports the ''-r'', ''-s'' and ''-p'' for running, suspended and pending jobs respectively.
 +
 +
  $ sall -r

Revision as of 15:10, 6 October 2015

Slurm batch software

The science cluster is switching to slurm for batch management. At the moment the TCM and HEF nodes are all switched to slurm. The other nodes will follow.

Partitions

Jobs always run within a **partition**. There are currently two partitions. One for the TCM group, called tcm and one for the HEF group, called hef. There is no default partition so you always have to provide the partition as an option either on the commandline as -p partitionname or in the shell script by including a line

 #SBATCH -p partitionname

Submitting your first job

To execute your job on the cluster you need to write it in the form of a shell script (don't worry this is easier than it sounds). A shell script in its most basic form is just a list of commands that you would otherwise write on the command line. The first line of the script should contain an instruction telling the system which type of shell is supposed to execute it. Unless you really need something else you should always use bash. So without further ado here is your first shell script.

 #! /bin/bash
 #SBATCH -p tcm 
 echo "Hello world!"

Type this into an editor (such as nano or vi) and save it as hello.sh. To execute this on the headnode give it executable permissions (not needed when submitting)

 $ chmod u+x hello.sh

and run it.

 $ ./hello.sh

It will print (//echo//) Hello world to the screen (also called //standard out// or stdout). If anything goes wrong an error message will be send to //standard error// or stderr which in this case is also the screen.

To execute this script as a //job// on one of the compute nodes we submit it to the cluster //scheduler//. This is done with the command sbatch.

 $ sbatch hello.sh

The scheduler will put your job in the named job //partition// and respond by giving you the //job number// (10 in this example).

 Submitted batch job 10

Your job will now wait until a slot on one of the compute nodes becomes available. Then it will be executed and the output is written to a file slurm-10.out in your home directory (unless you specify otherwise as explained later).

Inspecting jobs

At any stage after submitting a job you may inspect its status with squeue.

 $ squeue --job 10

It will print some information including its status which can be:

 * PD pending
 * R running
 * S suspended

it will also show the name of the job script, who submitted it and the time used so far. To see full information about your job use scontrol command

 $ scontrol show job 10

You can get a quick overview of how busy the cluster is with

 $ squeue

which lists all jobs sorted by the nodes they are running on.

More information

You may use the scontrol command to get information about the compute nodes e.g. the number of CPU's, available memory and requestable resources such as gpus if available.

 $ scontrol show nodes

Finally there is the sall command that gives a quick overview of all jobs on the cluster. It supports the -r, -s and -p for running, suspended and pending jobs respectively.

 $ sall -r