Software cluster


This post is over 5 years old, it may be out of date.

Cluster software

Warning

All clusternodes will be moved to SLURM. The text below deals with the older GridEngine clustersoftware. !!

Previous Cluster software

On the cnXX-cluster the Oracle Grid Engine cluster software has been installed.

Usage:

  • You can only submit shell-scripts, as an example here the shell-script ‘~/date’:

    #! /bin/sh
    /bin/date
    

To submit this script just enter: qsub -cwd ~/date. The output and error will be written to files ~/date.[oe]$jobnumber. Because of the -cwd they can be found in the current directory, not in the home directory.

If you want this job to run on a special host (here as an example ‘cn00’, you can use: qsub -q '*@cn00' ~/date.

We have configured hostgroups:

qconf -shgrpl

shows which hostgroups exist,

qconf -shgrp <hostgroep>

shows which subhostgroups or hosts are in that hostgroup. So you can use:

   qsub -q '*@@mlfhosts,*@@tcmhosts' ~/date

If it is not a ‘hard’ requirement, but only a ‘soft’ preference to run on a certain hostgroup:

qsub -soft -q '*@@mlfhosts' ~/date

A nice option of ‘qsub’ is: -p priority. Available for qsub, qsh, qrsh, qlogin and qalter only. Defines or redefines the priority of the job relative to other jobs. The priority is normally only important for Grid Engine when deciding which job to start. Grid Engine normally does not mess with running jobs. Priority is an integer in the range -1023 to 1024. The default priority value for jobs is 0. Users may only decrease the priority of their jobs. Grid Engine managers and administrators may also increase the priority associated with jobs. If a pending job has higher priority, it is earlier eligible for being dispatched by the Grid Engine scheduler.

Of course also the ‘nice’ command can be useful when using other people’s hosts:

    nice - run a program with modified scheduling priority
           -n, --adjustment=N
           add integer N to the niceness (default 10)
  • For an MPI job:

First create a ’smpd passphrase" file:

    touch ~/.smpd
    chmod 600 ~/.smpd
    echo "phrase=MyOwnPassword" > ~/.smpd

Please pick your own MyOwnPassword !!! This password is only used for MPI, don’t use your login password!

~/mpich2.sh:

    #!/bin/sh -x
    #
    #$ -S /bin/sh
    #
    # sample mpich2 job
    # you will need to adjust the $PATH to your mpich2 installation
    # be sure to get the correct mpiexec for mpich2_smpd!!!
    export PATH=/usr/local/mpich2_smpd/bin:$PATH
    port=$((JOB_ID % 5000 + 20000))
    echo "Got $NSLOTS slots."
    mpiexec -n $NSLOTS -machinefile $TMPDIR/machines -port $port ~/mpihello
    exit 0

which runs the compiled source of ~/mpihello.c:

    #include <stdio.h>
    #include "mpi.h"
    main(int argc, char** argv)
    {
     int noprocs, nid;
     MPI_Init(&argc, &argv);
     MPI_Comm_size(MPI_COMM_WORLD, &noprocs);
     MPI_Comm_rank(MPI_COMM_WORLD, &nid);
     if (nid == 0)
      printf("Hello world! I'm node %i of %i \n", nid, noprocs);
     MPI_Finalize();
    }

that has been compiled with:

/usr/local/mpich2_smpd/bin/mpicc mpihello.c -lmpich -o mpihello

One needs to choose the parallel environment ‘mpich2_smpd’ and the number of slots:

qsub -pe mpich2_smpd 2 ~/mpich2.sh

  • Other interesting commands:

    qstat - show the status of Grid Engine jobs and queues qmod - modify a Grid Engine queue -cj Clears the error state of the specified jobs(s). -cq Clears the error state of the specified queue(s).

If a job fails and puts the queue on a certain machine in error (E) state, a system administrator can be reached at postmaster@science.ru.nl, to clear this error state by entering something like:

qmod -c all.q@cn16

qmon - X-Windows OSF/Motif graphical user’s interface for Grid Engine

This graphical user’s interface for Grid Engine can be started on cn99 with:

ssh -X cn99 qmon