HPC cluster

Since 2013, the LSS maintains a 36-CPU High-Performance-Cluster. In the following, general information and hints about the usage of the machine is given. For problem reports, questions, suggestions etc. please mail to cs10-support@fau.de.

Hardware

The cluster consists of 8 compute nodes, 1 file server (11 TB), and 1 visualization node. Each compute node has the following specifications in common:

  • 4 x Intel(R) Xeon(R) CPU E7-4830, 2.13 GHz – 2.4GHz (max. turbo) (8 cores + SMT), SSE 4.1/4.2, 24 MB shared cache
  • 256 GB RAM
  • 2 x 300 GB SAS internal disks
  • NVIDIA GeForce GTX 680
  • QDR Infiniband network

The visualization node contains an additional NVIDIA Quadro K5000 graphics card.

Access

In order to login to the front end i10hpc.informatik.uni-erlangen.de you need a valid account at LSS. Access to the cluster is granted via authentication by ssh-keys.

Environment

Software

Because there are multiple combinations of compilers and libraries available, the Environment Modules Package has been provided for easy switching between different packages. You can find a full description of the modules project on the sourceforge modules homepage.

To get a list of available modules type:

module avail

To load a module, simply issue the command:

module load <name of module>

To unload the module use:

module unload <name of module>

To see what a module is doing when it gets loaded you can use the command:

module show <name of module>

A list of all modules currently loaded can be received by:

module list

Filesystems

When logging in you will find yourself in your LSS home directory. Besides, all cluster nodes have a shared filesystem mounted on /scratch containing directories /scratch/<login> Running over Infiniband this filesystem provides larger bandwith and more space than other shares at LSS. Therefore, simulations should store data in this directory. Besides, there is also limited local disk space mounted on /local on each node which is not shared between the nodes.

Usage

Interactive work

The front-end or login node is i10hpc.informatik.uni-erlangen.de. Here you can compile your code, make short test runs and submit jobs to the queueing system. For long running jobs, please use the queueing system. In case of abuse, jobs will be terminated without warning.

Batch-System

All jobs with moderate or high computation times have to be submitted to the batch system. The batch system is able to handle serial and parallel jobs. Jobs can be submitted only from the login node i10hpc. For that purpose a job script is neccessary, in which all requirements of the job are specified.

This script is submitted to the queueing system by:

qsub <name of script>

The qualifier of the job is returned by the queueing system. In order to see all queued jobs, the command

qstat -a

can be used. Queued jobs may also be deleted by the user who submitted the job. In this case the first number of the job identifier has to be used:

qdel <job id>

By using the command

qsub -W depend=afterany:<wait for job_id> <name_of_script>

you can make your submitted jobs script wait for the job with id <wait for job_id> to be finished. There are different queues configured:

Queue default walltime max. walltime max. number of nodes type of nodes remark
devel 10 min 1 h 4 compute-nodes only 1 job per user can run at a time

2 Nodes reserved for devel
from Mon – Fri 8:00 to 20:00

normal 1 h 24 h 5 compute-nodes no interactive jobs allowed
big 1 h 48 h 9 compute-nodes only runs at night and on the weekends
no interactive jobs allowed

If you do not specify a queue your job will be automatically routed to the most appropriate queue. In the following you can find examples of job scripts for different applications.

Serial job

This is a job script for starting a serial job. After 2 hours of wall clock time, the job will be terminated by the queueing system. The default is 1 hour. Notifcation by mail at start, finish or abort of the job will be sent to address <login>@fau.de. All essential environment variables have to be set before the executable is started.

#!/bin/bash -l
#PBS -l nodes=1:ppn=32
#PBS -l walltime=02:00:00
#PBS -q normal
#PBS -M <login>@fau.de -m abe
#PBS -N jobname
#PBS -o /scratch/<login>/jobname.e 
#PBS -e /scratch/<login>/jobname.o

export OMP_NUM_THREADS=1
cd /scratch/<login>/example
./executable

Parallel job

The number of nodes is set to 2 (with 32 cores each). After 8 hours of wall clock time, the job will be terminated by the queueing system. Do not forget to start the script as a login shell (#!/bin/bash -l) to be able to use the module command in batch scripts.

#!/bin/bash -l
#PBS -l nodes=2:ppn=32
#PBS -l walltime=08:00:00
#PBS -q normal
#PBS -M <login>@fau.de -m abe
#PBS -N jobname
#PBS -o /scratch/<login>/jobname.e
#PBS -e /scratch/<login>/jobname.o

module load openmpi/4.0.0-gnu
cd /scratch/<login>/example
mpirun -np 64 ./executable

Honouring NUMA / Pinning / Hybrid jobs

Each of the four processors in a node has its own memory bus, so that placing data adjacent to cores processing them is a prerequisite for ensuring good performance. Consequently, Linux tries to allocate RAM at the core who caused the respective page fault. But as threads and processes can migrate, this beneficial setup is generally not guaranteed to last.
There are various ways to prevent threads from roaming. Code compiled with Intel compilers can be instructed to do so through the KMP_AFFINITY environment variable. Alternatives that do not need compiler support are likwid, numactl and taskset (all of them being available on the cluster). See the respective man-pages for details.
OpenMPI provides some pinning support as command line arguments to mpirun via the –bind-to-node and –bind-to-socket options.
Hybrid jobs (OMP+MPI) are usually run best when one MPI process is used per node. OpenMPI can do all the job of starting the right number of jobs and pinning them to sockets if it knows the cluster’s topology (see example below), which must be given as command line options to mpirun. Setting the number of OpenMP worker threads manually may be a good idea, even if most OpenMP runtimes are likely to use as many worker threads as logical processors are in its cpuset, in this case the socket.

Pinning processes to a socket is usually sufficient:

mpirun -np 64 --bind-to-socket ./pure-mpi-program

Running a hybrid job with one process per socket and one worker per cores. –npersocket implies –bind-to-socket. Some prefer to “export OMP_NUM_THREADS=8” separately.

OMP_NUM_THREADS=8 mpirun --npersocket 1 -mca orte_num_sockets 4 -mca orte_num_cores 8 ./hybrid-program