Software cluster

From Cncz
Revision as of 11:52, 13 June 2008 by Petervc (talk | contribs)
Jump to navigation Jump to search

Cluster software

On the cnXX-cluster the Sun Grid Engine clustersoftware has been installed.


  • Make sure you can 'ssh' to all cluster machines without password. This involves:
    • Creating ".ssh/authorized_keys" by:
      • ssh-keygen (with empty passphrase)
      • cp ~/.ssh/ ~/.ssh/authorized_keys
      • Adding hostkeys for all clusterhosts to ~/.ssh/known_hosts, e.g. by ssh-ing to all hosts.

  • Import GridEngine settings into your shell:
    • (For csh/tcsh)
    source /vol/GridEngine/default/common/settings.csh
    • (For sh/bash)
    . /vol/GridEngine/default/common/

This will set or expand the following environment variables:

   - $SGE_ROOT         (always necessary)
   - $SGE_CELL         (if you are using a cell other than >default<)
   - $SGE_QMASTER_PORT (if you haven't added the service >sge_qmaster<)
   - $SGE_EXECD_PORT   (if you haven't added the service >sge_execd<)
   - $PATH/$path       (to find the Grid Engine binaries)
   - $MANPATH          (to access the manual pages)
  • You can only submit shell-scripts, as an example here the shell-script '~/date':
    #! /bin/sh

To submit this script just enter: qsub -cwd ~/date. The output and error will be written to files ~/date.[oe]$jobnumber. Because of the -cwd they can be found in the current directory, not in the home directory.

If you want this job to run on a special host (here as an example 'cn00', you can use: qsub -q '*@cn00' ~/date.

We have configured hostgroups:

@allhosts (cn00 + all hostgroups below)
@mlfhosts cn16 cn17 cn18 cn19 cn26 cn27 cn28 cn29
@snnhosts cn01 cn02 cn03 cn04 cn05
@thchemhosts cn10 cn11 cn12 cn13 cn14 cn15
@kristalhosts cn20 cn21 cn22 cn23 cn24 cn25
@tcmhosts cn06 cn07 cn08 cn09 cn30 cn31 cn32 cn33 cn34 cn35 cn36 cn37 cn38

so you can use:

   qsub -q '*@@mlfhosts,*@@tcmhosts' ~/date

If it is not a 'hard' requirement, but only a 'soft' preference to run on a certain hostgroup:

qsub -soft -q '*@@mlfhosts' ~/date

A nice option of 'qsub' is: -p priority. Available for qsub, qsh, qrsh, qlogin and qalter only. Defines or redefines the priority of the job relative to other jobs. The priority is normally only important for Grid Engine when deciding which job to start. Grid Engine normally does not mess with running jobs. Priority is an integer in the range -1023 to 1024. The default priority value for jobs is 0. Users may only decrease the priority of their jobs. Grid Engine managers and administrators may also increase the priority associated with jobs. If a pending job has higher priority, it is earlier eligible for being dispatched by the Grid Engine scheduler.

Of course also the 'nice' command can be useful when using other people's hosts:

	nice - run a program with modified scheduling priority
	       -n, --adjustment=N
	       add integer N to the niceness (default 10)
  • For an MPI job:

First create a 'smpd passphrase" file:

	touch ~/.smpd
	chmod 600 ~/.smpd
	echo "phrase=MyOwnPassword" > ~/.smpd

Please pick your own MyOwnPassword !!! This password is only used for MPI, don't use your login password!


	#!/bin/sh -x
	#$ -S /bin/sh
	# sample mpich2 job
	# you will need to adjust the $PATH to your mpich2 installation
	# be sure to get the correct mpiexec for mpich2_smpd!!!
	export PATH=/usr/local/mpich2_smpd/bin:$PATH
	port=$((JOB_ID % 5000 + 20000))
	echo "Got $NSLOTS slots."
	mpiexec -n $NSLOTS -machinefile $TMPDIR/machines -port $port ~/mpihello
	exit 0

which runs the compiled source of ~/mpihello.c:

	#include <stdio.h>
	#include "mpi.h"
	main(int argc, char** argv)
	 int noprocs, nid;
	 MPI_Init(&argc, &argv);
	 MPI_Comm_size(MPI_COMM_WORLD, &noprocs);
	 MPI_Comm_rank(MPI_COMM_WORLD, &nid);
	 if (nid == 0)
	  printf("Hello world! I'm node %i of %i \n", nid, noprocs);

that has been compiled with: /usr/local/mpich2_smpd/bin/mpicc mpihello.c -lmpich -o mpihello

One needs to choose the parallel environment 'mpich2_smpd' and the number of slots: qsub -pe mpich2_smpd 2 ~/

  • Other interesting commands:
qstat - show the status of Grid Engine jobs and queues
qmod - modify a Grid Engine queue
-cj    Clears the error state of the specified jobs(s).
-cq    Clears the error state of the specified queue(s).

If a job fails and puts the queue on a certain machine in error (E) state, a system administrator can be reached at , to clear this error state by entering something like: qmod -c all.q@cn16