Contents

 

Batch systems

Running jobs with SLURM

Cancelling Jobs

Getting job information

Modules and Provided software 

Examples of submission scripts

Running MPI jobs

Running OpenMP jobs

The Debug Partition

Interactive Jobs

Getting Help

 

 

If you haven't yet used the clusters please see the guide on how to connect

 

In the following examples we use "username" to mean your username on the clusters.

 

Batch systems

 

The key to using the clusters is to keep in mind that all jobs or work need to be given to a program called a batch system and your tasks are then scheduled and run as resources become available. Except for rare cases the idea is not to have real-time interaction and, even in these cases, we still pass via the batch system.

 

The clusters all use SLURM which is widely used and open source http://slurm.schedmd.com 

 

Running jobs with SLURM

 

The normal way of working is to create a short script that describes what you need to do and submit this to the batch system using the "sbatch" command.

 

For example here's a script to run a code called moovit:

 

#!/bin/bash
#SBATCH --workdir /scratch/gruyere/clara/moovit-results
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
#SBATCH --mem 4096
#SBATCH --time 12:30:00 
echo STARTING AT `date`
/home/myusername/code/moovit < /home/myusername/params/moo1
echo FINISHED at `date`

 

Any line beginning with #SBATCH is a directive to the batch system (see "man sbatch" for the full list)

 

The six options shown are more or less mandatory and do the following:

 

--workdir /path/to/working/directory

This is the directory in which the job will be run and the standard output files written. This should ideally point to your scratch space.

 

--ntasks 1

The ntasks is the number of tasks (in an MPI sense) to run per job

 

--cpus-per-task 1

 This is the number of cores per aforementioned task

 

--nodes 1

 This is the number of nodes to use - on Castor this is limited to 1 but it's good practice to request it anyway! 

 

--mem 4096

The memory required in MB per node

 

--time 12:00:00 # 12 hours

--time 2-6 # two days and six hours

The time required - there are a number of formats so see "man sbatch" for the details:

 

If the time and memory are not specified then default values will be imposed - these may well be lower than required!

 

This script is saved as moojob1.run and in order to submit it we run the following command from one of the login nodes:

 

sbatch moojob1.run

 

The output will look something like

 

[user@frontend]$ sbatch moojob1.run
Submitted batch job 123456 

The number returned is the Job ID and is the key to finding out further information or modifying the task. 

 

Cancelling Jobs

 

To cancel a specific job:

scancel JOB ID

To cancel all your jobs (use with care!):

scancel -u username

To cancel all your jobs that are not yet running

scancel -u username -t PENDING 

 

 

Getting job information

 

There are a number of different tools that can be used to query jobs depending on exactly what information is needed. If the name of a tool begins with a capital S then it is a SCITAS specific tool. Any tool whose name starts with a small s is part of the base SLURM distribution.

 

Squeue

Squeue shows information about all your jobs be they running or pending.

 [bob@machine]$ Squeue
JOBID NAME ACCOUNT USER NODE CPUS MIN_MEMORY ST REASON START_TIME NODELIST 123456 run1 scitas bob 6 96 32000 R None 2015-10-30T04:18:37 r04-node[32-37] 123457 run2 scitas bob 6 16 32000 PD Dependency N/A

 

 

squeue

 

By default squeue will show you all the jobs from all users. This information can be modified by passing options to squeue.

 

To see all the running jobs from the scitas group we run: 

 

[user@machine ~]# squeue -t R -A scitas
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            123456  parallel  gromacs      bob  R      48:43      6 r04-node[32-37]
            123457  parallel     pw.x      sue  R   18:06:44      8 r01-node[03,11,21],r04-node[50,61-64]

See "man squeue" for all the options.

 

For example, the Squeue command described above is actually a script that calls:

 

squeue -u $USER -o "%.10A %.12j %.8a %.10u %.4D %.5C %.11m %.6t %.12r %.20S %.20N" -S S

 

scontrol

 

scontrol will show you everything that the system knows about a running or pending job.

 

scontrol -d show job <job id>

 

[user@castor jobs]$ scontrol -d show job 400
JobId=400 Name=s1.job
UserId=user(100000) GroupId=scitas(11902)
Priority=111 Account=scitas QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:03:39 TimeLimit=00:15:00 TimeMin=N/A
SubmitTime=2014-03-06T09:45:27 EligibleTime=2014-03-06T09:45:27
StartTime=2014-03-06T09:45:27 EndTime=2014-03-06T10:00:27
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=serial AllocNode:Sid=castor:106310
ReqNodeList=(null) ExcNodeList=(null)
NodeList=c03
BatchHost=c03
NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
Nodes=c03 CPU_IDs=0 Mem=1024
MinCPUsNode=1 MinMemoryCPU=1024M MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/user/jobs/s1.job
WorkDir=/scratch/scitas/user

 

Sjob

 

Sjob is particularly useful to find out information about jobs that have finished.

 

[user@deneb2 jobs]$ Sjob  296176
       JobID    JobName    Cluster    Account  Partition  Timelimit      User     Group 
------------ ---------- ---------- ---------- ---------- ---------- --------- --------- 
296176           s5.job      deneb   scitas      debug   00:10:00     user scitas 
296176.batch      batch      deneb   scitas                                           
296176.0          sleep      deneb  scitas                                           
296176.1          sleep      deneb  scitas                                           
             Submit            Eligible               Start                 End 
------------------- ------------------- ------------------- ------------------- 
2015-10-26T15:10:37 2015-10-26T15:10:37 2015-10-26T15:10:38 2015-10-26T15:10:59 
2015-10-26T15:10:38 2015-10-26T15:10:38 2015-10-26T15:10:38 2015-10-26T15:10:59 
2015-10-26T15:10:38 2015-10-26T15:10:38 2015-10-26T15:10:38 2015-10-26T15:10:49 
2015-10-26T15:10:49 2015-10-26T15:10:49 2015-10-26T15:10:49 2015-10-26T15:10:59 
   Elapsed ExitCode      State 
---------- -------- ---------- 
  00:00:21      0:0  COMPLETED 
  00:00:21      0:0  COMPLETED 
  00:00:11      0:0  COMPLETED 
  00:00:10      0:0  COMPLETED 
     NCPUS   NTasks        NodeList    UserCPU  SystemCPU     AveCPU  MaxVMSize 
---------- -------- --------------- ---------- ---------- ---------- ---------- 
        16               r02-node01  00:00.053  00:00.066                       
        16        1      r02-node01  00:00.038  00:00.034   00:00:00    395556K 
         1        1      r02-node01  00:00.008  00:00.014   00:00:00    210292K 
       1        1      r02-node01  00:00.005  00:00.016   00:00:00    210292K 

 

Modules and Provided software 

 

Modules (LMod) is utility that allows multiple, often incompatible, tools and libraries to exist on a cluster. Scientific tools and libraries are provided as modules and you can see what is available by running "module avail":

 

$ module avail
-------------- /path/to/base/modules ---------------  
cmake  gcc  intel  matlab  

 

Initially you will only see the base modules - these are either compilers or stand alone packages such as MATLAB. In order to see more modules including libraries and MPI distributions you need to load a compiler:

 

$ module load gcc

$ module avail
--------------- /path/to/gcc/modules ----------------  
gdb  fftw  hdf5  mvapich2  openmpi  python R

-------------- /path/to/base/modules ---------------  
cmake  gcc  intel  matlab  

 

The full guide to how to use module can be found here

 

In your submission script we strongly recommend that you begin with a "module purge" and then load the module you need to as to ensure that you always have the correct environment.

 

Examples of submission scripts

 

There are a number of examples available on our GIT repository. To download these run the following command from the clusters:

git clone https://<gaspar-username>@git.epfl.ch/repo/scitas-examples.git

Enter the directory scitas-examples and choose the example to run by navigating the folders. We have three categories of examples: Basic (examples to get you started), Advanced (including hybrid jobs and job arrays) and Modules (specific examples of installed software).

 

To run an example (here: hybrid HPL), do

sbatch --partition=debug hpl-hybrid.run

or, if you do not wish to run on the debug partition,

sbatch hpl-hybrid.run

 

 

Running MPI jobs

 

MPI is the acronym for Message Passing Interface and is now the de facto standard for distributed memory parallelisation.

It's an open standard with multiple implementations and we are now at version 3.

There are multiple MPI flavours that comply with the specification and each claims to have some advantage over the other. Some are vendor specific and others are open source.

 

On the SCITAS clusters we only support the following compiler/MPI combinations (July 2016 until Jully 2017):

 

Intel Composer 2016 with Intel MPI 2016 

GCC 5.3 with MVAPICH2 version 2.2

 

GCC 5.3 with OpenMPI version 1.10

 

This is a SCITAS restriction to prevent chaos - nothing technically stops one from mixing! Both work well and have good performance.

 

If we have a MPI code we need some way of correctly launching it across multiple nodes. To do this we use srun which is SLURM’s built in job launcher

 

srun mycode.x

 

To specify how many ranks and the number of nodes we add the relevant #SBATCH directives to the job script. For example to launch our code on 4 nodes with 16 ranks per node we specify:

 

#!/bin/bash
#SBATCH --nodes 4
#SBATCH --ntasks-per-node 16
#SBATCH --cpus-per-task 1
#SBATCH --mem 32000
#SBATCH --time 1-0
module purge
module load mycompiler
module load mympi
srun /home/bob/code/mycode.x

There is no need to specify the number of ranks when you call srun!

 

 

Running OpenMP jobs

 

When  running an OpenMP or hybrid OpenMP/MPI job the important thing to set is the number of OpenMP threads per process via the variable OMP_NUM_THREADS. If this is not specified it often defaults to the number of processors in the system. 

We can integrate this with SLURM as seen for the following hybrid (4 ranks, 4 threads per rank) task:

 

#!/bin/bash
#SBATCH --ntasks 4
#SBATCH --cpus-per-task 4
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun mycode.x

This takes the environmental variable set by SLURM and assigns the value to OMP_NUM_THREADS.

If you run such hybrid jobs we advise you to read the page on CPU affinity

 

The Debug Partition 

 

All the clusters have a few nodes that only allow short jobs and are intended to give you quick access to allow you to debug jobs or quickly test input files.

To use these nodes you can either add the #SBATCH -p debug directive to your job script or specify it on the command line:

 

sbatch -p debug myjob.run

 

Please note that the debug nodes must not be used for production runs of short jobs. Any such use will result in access to the clusters being revoked. 

 

Interactive Jobs

 

There are two main methods of getting interactive (rather than batch) access to the machines. Thay have different use cases and advantages.

 

 

Sinteract

 

The Sinteract command allows one to log onto a compute node and run applications directly on it. This can be especially useful for graphical applications such as Matlab and Comsol. 

 

[user@frontend ]$ Sinteract
Cores: 1
Time: 00:30:00
Memory: 4G
Partition: serial
Jobname: interact
salloc: Granted job allocation 579438 salloc: Waiting for resource configuration salloc: Nodes z15 are ready for job
[user@z15 ]$ 

Please note that to use a graphical application you must have connected to the login node with "ssh -Y".  Sinteract can also be used with the debug partition if appropriate. 

 

salloc

 

salloc creates a reservation on the system that you can then access via srun. It allows one to run multi-node MPI jobs in an interactive manner and is very useful for debugging problems with such tasks.

 

[user@frontend ]$ salloc -N 2 -n 2 --mem 2048 
salloc: Granted job allocation 579440 salloc: Waiting for resource configuration salloc: Nodes z[17,18] are ready for job

[user@frontend ]$ hostname
frontend

[user@frontend ]$ srun hostname
z17
z18

[user@frontend ]$ exit
salloc: Relinquishing job allocation 579440

 

Getting Help

 

If you have problems then please see our page on how to ask for help

 

SCITAS also offers a wide range of training courses covering all aspects of scientific and high-performance computing. The list is available here.