General FAQ

Why can't I connect to the clusters from home?

What's the maximum run time of a job? 

Where is my scratch space?

Can you recover an important file that was on my scratch area?

I've deleted a file on /home or /work - How can I recover it? 

How do I submit a job that requires a run time of more than one/three days? 

Can I submit array jobs and, if so, how?

Is it safe to share nodes with other users?

Is there a debug queue?

I have premium and I have run on the debug partition. Do I have to pay for debug time?

What is a <job id>?

How to display /scratch quota and usage information?

How to display quota and usage information for the /home and /work file systems?

Why do I get the error "module: command not found"?

Which options should I use to link with the Intel MKL?

Why am I asked for a password while sshing from the frontend to a node?

Which MPI flavours are supported on the clusters?

What compilers/MPI combination do you support?

Why is "Premium" not an option for Bellatrix? 

 

 

 

 

Bellatrix FAQ

How many nodes are there in Bellatrix?

What are the characteristics of a node in Bellatrix?

 

Castor FAQ

How many nodes are there in Castor?

What are the characteristics of a node in Castor?

Can I run my MPI job on Castor?

Why doesn't Castor have Infiniband?

What's the maximum run time of a job on Castor?

How do I submit a job that requires a run time of more than three days? 

How do I use one of the nodes with 256GB of memory?

 

Deneb FAQ

Why do I need to ask for access to the GPU nodes?

How do I submit jobs to the GPU nodes?

How do I use one of the nodes with more than 64GB of memory?

How do I ask to use a specific processor type (Ivy Bridge or Haswell)?

 


 

General FAQ

 

Q. Why can't I connect to the clusters from home?

A. You can but to do so requires passing via the EPFL VPN service. See http://network.epfl.ch/vpn for how to use this service.

Users preferring a command line tool might also wish to consider the tremplin SSH proxy tunnel service:http://tremplin.epfl.ch/ssh.html. You can find the Linux and Windows procedure here.

Q. What's the maximum run time of a job?

A. The maximum wall time allowed depends on the cluster - please see the more cluster specific FAQs below. You should also be aware that if there are maintenance operations scheduled any jobs you submit that would finish, based on the walltime requested, after the start of these periods will not run until after the maintenance has ended.

Q. Where is my scratch space?

A. /scratch/<user name> - e.g. /scratch/jmenu

Q. Can you recover an important file that was on my scratch area?

A. **NO**. /scratch is not backed up so the file is gone forever. Please note that we automatically delete files on scratch to prevent it from filling up!

Q. I've deleted a file on /home or /work - How can I recover it?

A. If it was deleted in the last seven days then you can use the daily snapshots to get it back. These can be found at:

  • /home/.snapshots/<date>/<username>/

  • /work/.snapshots/<date>/<laboratory or group>/

e.g. /home/.snapshots/2015-11-11/bob/

The home filesystem is backed up to tape so if the file was deleted more than a week ago we may be able to help. The work filesystem is not backed up by default.


Q. How do I submit a job that requires a run time of more than three days? 

A. If you have access to a pure share then you can ask for the "week" or "month" QoS:

sbatch --qos week / month myjobscript or with #SBATCH --qos week / month in your job script.

Users who have not paid are limited to 24 hours. 

If your group has purchased shares or premium then please contact your local computing co-ordinator who will in turn contact SCITAS if necessary.

Q. Can I submit array jobs and, if so, how?

A. Yes, with the --array directive to sbatch. See http://slurm.schedmd.com/job_array.html for the official documentation and our scitas-examples git for several examples.

Q. Is it safe to share nodes with other users?

A. Yes!  We use cgroups to limit the amount of CPU and memory assigned to users so there is no way for users to adversely affect each other.

Q. Is there a debug queue?

A. Not as such. In SLURM the concept of queues doesn't exist (comme à midi à l'Ornithorynque) so to have priority access for debugging there is a partition which gives priority access:

sbatch --partition debug myjobscript

The limits on the debug partition vary by cluster but in general the maximum run time is 30 minutes to one hour and users are only allowed one job at a time. Interactive jobs are allowed.

Q. I have premium and I have run on the debug partition. Do I have to pay for debug time?

A. No. Debug time is free of charge.

Q. What is a <job id>?

A. It's the unique numerical identifier of a job and is given when you submit the job:

[eroche@castor jobs]$ sbatch s1.job
Submitted batch job 400

It can also be seen using squeue:

[eroche@castor jobs]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
400 serial s1.job eroche R INVALID 1 c03

Q. How to display scratch quota and usage information?

A. There are no quotas on scratch, as files older than 2 weeks may be deleted without notice as the filesystems fill up. However, you can see scratch usage for aries, bellatrix and deneb using the fsu command:

fsu /scratch

The scratch usage information of Castor can be found here and by executing the following command on Castor:

df -h /scratch

Q. How to display quota and usage information for the /home and /work file systems?

A. /home: to get user quota and file system usage for your group members, use the following command:

fsu -q /home

You can also see an overview of /home usage and quota here.

B. /work: to get group quota and file system usage for your group members, use the following command:

fsu -q /work

You can also see an overview of /work file system usage and quota here.

Q. Why do I get the error "module: command not found"?

A. This is because you have tcsh as your login shell and the environment isn't propogated to the compute nodes.

In order to fix the issue please change the first line of your job script as follows:

#!/bin/tcsh -l

The -l option tells tcsh to lauch an interactive shell which correctly sources the files in /etc/profile.d/

Q. Which options should I use to link with the Intel MKL?

Ask the Intel Math Kernel link line advisor

If you use the Intel compilers then you can pass the -mkl flag which will do the hard work for you.

Q. Why am I asked for a password while sshing from the frontend to a node?

A. Once logged in to a frontend of a cluster, you can ssh directly to the node(s) running your job(s). You can prevent to be asked for the Gaspar password again by creating a passwordless ssh key.

Run the following command only once in any of the clusters:

ssh-keygen -t rsa
ssh-copy-id -i .ssh/id_rsa.pub localhost

Q. Which MPI flavours are supported on the clusters?

A. SCITAS supports IntelMPI and MVAPICH2. Two recent versions of each will be installed at any one time. 

Q. What compilers/MPI combination do you support?

A. SCITAS supports Intel compilers and Intel MPI (full proprietary) or GCC compilers and MVAPICH2 (full free). Other combinations are not supported.

Q. Why is "Premium" not an option for  Bellatrix? 

A. Bellatrix has been completely allocated to users as shares so there is no spare capacity to allow people to purchase compute cycles under the premium model. 

 

 

 

Cluster / Partition specific FAQ

 

Bellatrix

Q. How many nodes are there in Bellatrix?

A. There are 424 compute nodes and one login node.

Q. What are the characteristics of a node in Bellatrix?

A. Bellatrix nodes have two 8 core Intel(R) Xeon(R) CPU E5-2660  processors running at 2.2GHz. The nodes have 32 GB of memory and are interconnected with QDR Infiniband

Q. How do I submit a job that requires a run time of more than three days? 

A. This depends on what type of account your group/laboratory has. 

If you have access to a pure share then you can ask for the "week" QoS: sbatch --qos week myjobscript or with #SBATCH --qos week in your job script.

If your group has "virtually private" nodes then please contact your local computing co-ordinator who will in turn contact SCITAS if necessary.

 

Castor

Q. How many nodes are there in Castor?

A. There are 52 compute nodes and one login node.

Q. What are the characteristics of a node in Castor?

A. Castor nodes have two 8 core Intel E5-2650 processors running at 2.6GHz. 50 of the nodes have 64GB of memory and two have 256GB

Q. Can I run my MPI job on Castor?

A. As long as it stays within a node then you are free to use MPI. Inter-node MPI is not the goal of this cluster. Jobs that request more than one node will be refused by the scheduler.

Q. Why doesn't Castor have Infiniband?

A. Castor was, from the very beginning, intended to run serial codes. As such, a low latency interconnect serves no purpose and would add cost and maintenance problems. The storage runs over 10 gigabit ethernet which is more than sufficient. 

Q. What's the maximum run time of a job on Castor?

A. If you have a free account it's 24 hours. For premium and share accounts it's 3 days but you can ask to run for longer by contacting us and explaining why you need to run for more than 3 days.

Q. How do I submit a job that requires a run time of more than three days? 

A. Premium and share accounts that have been granted permission to do so can add the "--qos=week" flag to ask for up to 7 days.

Q. How do I use one of the nodes with 256GB of memory?

A. Specify the amount of memory required with "--mem <quantity in MB>" either on the command line or in your job script.

Deneb

Q. Why do I need to ask for access to the GPU nodes?

A. In order to use the GPU nodes we request that you submit a description of the code you wish to use and the performance benefits expected. You will then be invited to meet our application and GPU experts to discuss your proposal. This is in order to ensure that your code will make the best possible use of the resources and to make sure that you understand the features and limtations of the nodes. Non paying access is limited to a maximum run time of 12 hours and one task at a time.

Q. How do I submit jobs to the GPU nodes?

A. If you have been granted "free" access then you need to pass the options "--partition=gpu --qos=gpu_free --gres=gpu:X"  where X is the number of GPUs per node required to sbatch. Users who have paid for a share on the GPU partition should use the gpu Qos: "--qos=gpu"

Q. How do I use one of the nodes with more than 64GB of memory?

A. Specify the amount of memory required with "--mem <quantity in MB>" either on the command line or in your job script.

Q. How do I ask to use a specific processor type (Ivy Bridge or Haswell)?

A. For Ivy Bridge please give the option "--constrain=E5v2" and for Haswell "--constrain=E5v3". If you do not specify a constraint it may run on either but a multi-node job will never span both architectures.