Running calculations

Introduction

Jobs (both batch and interactive sessions) on HORDA should be run through slurm resource manager. For the quick overview of slurm you can refer to the video: link

Information

Slurm details:

Two partitions are available - ogr (cpu only nodes) and troll (gpu nodes) or all (all nodes).
The time of execution is limited to 28 days on each partition.

Example

To get the information on the currently running jobs run squeue:

~$ squeue 
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
87719       troll interact username  R 11-18:07:21      1 troll-1

To get the information on the slurm partitions and their details run sinfo:

~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up 28-00:00:0      3   idle ogr-[0-1],troll-1
troll        up 28-00:00:0      1   idle troll-1
ogr          up 28-00:00:0      2   idle ogr-[0-1]

Interactive sessions

HORDA can be used for interactive work with data, e.g. performing ad-hoc analyses and visualizations with python and jupyter-notebooks. You can either start an interactive session using srun or allocate resources using the salloc utility and then ssh into that host and work there. You can tweak your allocation depending on work needs, see the following table for details and examples. Both commands have a similar set of options:

Argument	Description
-n	Number of cores used allocated for the job (default = 1, max = 36)
--gres	Number of GPUs allocated for the job (_default = None, --gres=gpu, --gres=gpu:2)
--mem	Amount of memory (in GBs) per allocated core allocated for the job (default = 1, max = 60)
-w	Specify host or hosts to get your resources (i.e. troll-1)

Example

To login interactively to troll-1 with 2 gpus and 8 cores and a total of 12 GB of memory:

srun -n 8 -w troll-1 --gres=gpu:2 --mem=12 --pty bash

To obtain an allocation on troll-1 with 2 gpus and 8 cores and a total of 12 GB of memory:

salloc -n 8 -w troll-1 --gres=gpu:2 --mem=12

Important! Please remember to quit your interactive allocation when you're done with your work. You can do it by simply typing exit (or CTRL+D).

Batch jobs

Longer, resource demanding jobs typically should be scheduled in SLURM batch mode. Below you can find the example of the SLURM batch script that you can use to schedule a job:

Example

Suppose the following job.sh batch file:

#!/bin/bash
#SBATCH -p troll        # troll partition
#SBATCH -n 8            # 8 cores
#SBATCH --gres=gpu:1    # 1 GPU 
#SBATCH --mem=30GB      # 30 GB of RAM
#SBATCH -J job_name     # name of your job

your_program -i input_file -o output_path

You can submit the specified job via sbatch command:

~$ sbatch job.sh
Submitted batch job 1234