Submitting Jobs¶

The scheduler¶

HPC systems are usually composed out of large number of nodes and have large number of users simultaneously using the facility. How do we ensure that the available resources are managed properly? This is a job of the scheduler, which is a heart and soul of the system and it's resposible for managing the jobs that are run on the cluster.

As an analogy, think of the scheduler as a waiter in busy restaurant. This hopefully will give you some idea why sometimes you have to wait for the job to run.

Waiter analogy

CREATE is using SLURM scheduler, which stands for Simple Linux Utility for Resource Management.

Partitions¶

You can think of partitions as queues - they reside over specific sets of resources and allow access to particular groups. Following the restaurant analogy, think of them as different sections of the restaurant and corresponding queues assigned to them.

The public partitions are:

cpu: Partition for cpu jobs
gpu: Partition for gpu jobs
long_: Partitions for long running jobs. Requires justification and explicit permission to use
interruptible_: Partitions that use unused capacity on private servers

In addition, specific groups/faculties have their own partitions on CREATE HPC that can only be used by members of those groups. The list of CREATE partitions and who can use them can be found in our documentation. Additional information about the resource constraints can be found here.

You can get the list of partitions that are available to you via sinfo --summarize command:

k1234567@erc-hpc-login1:~$ sinfo --summarize
PARTITION         AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
cpu*                 up 2-00:00:00        35/0/0/35 erc-hpc-comp[001-028,183-189]
gpu                  up 2-00:00:00        13/6/0/19 erc-hpc-comp[030-040],erc-hpc-vm[011-018]
interruptible_cpu    up 1-00:00:00       20/65/0/85 erc-hpc-comp[041-047,058-109,128-133,135,137,139-151,153-154,157,179-180]
interruptible_gpu    up 1-00:00:00       24/17/2/43 erc-hpc-comp[048-057,110-127,134,170-178,190-194]

Any additional rows you see in the output of sinfo will be private partitions you have access to.

Hint

NODES(A/I/O/T) column refers to nodes state in the form allocated/idle/other/total.

Submitting jobs¶

In most cases you will be submitting non-interactive jobs, comonly referred to as batch jobs. For this you will be using the sbatch utility.

To submit a job to the queue, we need to write a shell script which contains the commands we want to run. When the scheduler picks our job from the queue, it will run this script. There's several ways we could create this script on the cluster, but for short scripts it's often easiest to use a command line text editor to create it directly on the cluster. For more complex scripts you might prefer to write them on your computer and transfer them across, but it's relatively rare that job submission scripts get that complex.

One common text editor that you should always have access to on systems like CREATE is nano:

nano test_job.sh

Nano is relatively similar to a basic graphical text editor like Notepad on Windows - you have a cursor (controlled by the arrow keys) and text is entered as you type. Once we're done, we can use Ctrl + O to save the file - at this point if we haven't already told Nano the filename it will ask for one. Then finally, Ctrl + X to exit back to the command line. If we ever forget these shortcuts, Nano has a helpful reminder bar at the bottom.

If you find yourself doing a lot of text editing on the cluster, it may be worth learning to use a more advanced text editor like Vim or Emacs, but Nano is enough for most people.

We are going to start with a simple shell script test_job.sh that will contain the commands to be run during the job execution:

#!/bin/bash -l

echo "Hello World! "`hostname`
sleep 60

From the login node, submit the job to the scheduler using:

sbatch --partition cpu --reservation=cpu_introduction test_job.sh

With this command we tell the scheduler that we want to use the partition named "cpu" and using the reserved node we've put aside for this workshop. By reserving some space on the cluster, we've hopefully made sure that submitted jobs will run quickly. Outside of this workshop, you won't usually have access to a reservation, so you should submit your jobs without the --reservation=cpu_introduction argument.

If necessary we can also often get test jobs like these to run more quickly by using the interruptible_cpu queue. The interruptible queues make use of otherwise unused space on private nodes, but if the owner of the nodes wants to use them, your running jobs may be cancelled. It's useful for quick testing, but if you're going to use the interruptible queues for real jobs you need to make sure they can be safely cancelled and not lose progress - this is often done via checkpointing.

Once the command is executed you should see something similar to:

k1234567@@erc-hpc-login1:~$ sbatch --partition cpu --reservation=cpu_introduction test_job.sh
Submitted batch job 56543

Info

If you do not define a partition during the job submission the default, cpu partition will be used.

The job id (56543) is a unique identifier assigned to your job and can be used to query the status of the job. We will go through it in the job monitoring section.

Important

When submitting a support request please provide relevant jobids of the failed, or problematic jobs.

Interactive jobs¶

Sometimes you might need, or want to run things interactively, rather than submitting them as batch jobs. This could be because you want to debug or test something, or the application/pipeline does not support non-interactive execution. To request interactive job via the scheduler use srun utility:

srun --partition cpu --reservation cpu_introduction --pty /bin/bash -l

The request will go through the scheduler and if resources are available you will be placed on a compute node, i.e.

k1234567@erc-hpc-login1:~$ srun --partition cpu --reservation cpu_introduction --pty /bin/bash -l
srun: job 56544 queued and waiting for resources
srun: job 56544 has been allocated resources
k1234567@erc-hpc-comp001:~$

To exit an interactive job, we use the Bash command exit - this exits the current shell, so if you're inside an interactive job it will exit that, if you're just logged in to one of the login nodes, it will disconnect your SSH session.

Warning

At the moment there are no dedicated partitions, or nodes for interactive sessions and those sessions share the resources with all of the other jobs. If there are no free resources available you request will fail.

Running applications with Graphical User Interfaces (GUIs)

To run an interactive job for an application with a Graphical User Interface (GUI), for example RStudio, you must enable 'X11 forwarding' and 'authentication agent forwarding' when you connect to CREATE:

ssh -XA hpc.create.kcl.ac.uk

Then request compute resources using salloc - once your resources have been allocated you can then connect to the node with a further ssh connection:

salloc <parameters>
ssh -X $SLURM_NODELIST
xeyes

Job monitoring¶

It is important to be able to see the status of your running jobs, or to find out information about completed, or failed jobs.

To monitor the status of the running jobs use squeue utility. Without any arguments, the command will print queue information for all users, however you can use --me parameter to filter the list:

k1234567@erc-hpc-login1:~$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             56544       cpu     bash k1234567  R       6:41      1 erc-hpc-comp001

Info

Job state is described in the ST column. For the full list of states please see squeue docs (JOB STATE CODES section).

The most common codes that you might see are:

PD: Pending - Job is awaiting resource allocation.
R: Running - Job currently has an allocation.
CG: Completing - Job is in the process of completing. Some processes on some nodes may still be active.
CD: Completed - Job has terminated all processes on all nodes with an exit code of zero.

For jobs that have finished, you can use sacct utility to extract the relevant information.

sacct -j 56543
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
56543        test_job.+        cpu        kcl          1  COMPLETED      0:0
56543.batch       batch                   kcl          1  COMPLETED      0:0

The above shows the default information.

You can use --long option to display all of the stored information, or alternatively you can customise your queries to display the information that you specifically looking for by using --format parameter:

sacct -j 13378473 --format=ReqMem,AllocNodes,AllocCPUS,NodeList,JobID,Elapsed,State
    ReqMem AllocNodes  AllocCPUS        NodeList        JobID    Elapsed      State
---------- ---------- ---------- --------------- ------------ ---------- ----------
    1000Mc          1          1         noded19 13378473       00:00:01  COMPLETED
    1000Mc          1          1         noded19 13378473.ba+   00:00:01  COMPLETED

For the list of available options please see the job accounting fields in the sacct documentation.

Check how efficiently your job used its resources

sacct can be used to check how efficiently your job used the resources you requested. For example, you can use the option --format=JobID,JobName,Timelimit,Elapsed,CPUTime,ReqCPUS,NCPUS,ReqMem,MaxRSS to get information on the maximum memory usage, total elapsed time, and CPU time used by your job.

k1234567@erc-hpc-login1:~$ sacct -j 8328 --format=JobID,JobName,Timelimit,Elapsed,CPUTime,ReqCPUS,NCPUS,ReqMem,MaxRSS
JobID           JobName  Timelimit    Elapsed    CPUTime  ReqCPUS      NCPUS     ReqMem     MaxRSS
------------ ---------- ---------- ---------- ---------- -------- ---------- ---------- ----------
8328           hellowor   00:02:00   00:00:07   00:00:28        4          4         2G
8328.ba+          batch              00:00:07   00:00:28        4          4               325772K

The example job above requested 2GB of memory but only used about 0.33 GB, and requested up to 2 minutes but only took 7 seconds.

You should look at resource usage of your jobs and use this to guide the resources you request for similar jobs in the future. Jobs that request lower resources will likely be scheduled faster. Requesting only the resources you need also ensures that the HPC resources are used efficiently. Requesting more CPUs or memory than you need can stop other people's jobs running and lead to significant HPC resources sitting idle.

Cancelling jobs¶

You can cancel running, or queued job using scancel utility. You can cancel specific jobs using their jobid

k1234567@erc-hpc-login1:~$ scancel 56544

If you want to cancel all of your jobs you can add the --user option

k1234567@erc-hpc-login1:~$ scancel --user k1234567

Choosing the resources¶

Jobs require resources to be defined, e.g. number of nodes, cpus, amount of memory or the runtime. Defaults, such as 1 day runtime, 1 core, 1 node, etc are provided for convenience, but in most cases they will not be sufficient to accomodate more intensive jobs and explicit request has to be made for more.

Warning

If you do not request enough resources and your job exceeds the allocated amount, it will be terminated by the scheduler.

The resources can be requested by passing additional options to the sbatch and srun commands. In most cases you will be using the following parameters to define the resources for your job:

partition: --partition, or -p defines which partition, or queue your job will be targeting
memory: --mem defines how much memory your job needs per allocated node
tasks: --ntasks, or -n defines how many tasks your job needs
nodes: --nodes, or -n defines how many nodes your job requires
cpus per task: --cpus-per-task, defines how many cpus per task are needed
runtime: --time, or -t defines how much time your job needs to run (in the D-HH:MM format)
reservation: --reservation asks the scheduler to allocate your job to some pre-existing reserved space

For a full list of options please see sbatch documentation.

You can provide those options as arguments to the sbatch, or srun commands, i.e.

sbatch --job-name test_job --partition cpu --reservation=cpu_introduction --ntasks 1 --mem 1G --time 0-0:2 test_job.sh

however that can be time consuming and prone to errors. Luckily you can also define those resource requirements in your submission scripts using #SBATCH tags. The sample job from the previous section will look like:

#!/bin/bash -l

#SBATCH --job-name=hello-world
#SBATCH --partition=cpu
#SBATCH --ntasks=1
#SBATCH --mem=1G
#SBATCH --reservation=cpu_introduction
#SBATCH -t 0-0:2 # time (D-HH:MM)

echo "Hello World! "`hostname`
sleep 60

Info

In bash, and other shell scripting languages # is a special character usually representing comment (#! is an exception used to define the interpreter that the script will be executed with) and is ignored during the execution. For information on special characters in bash please see here.

#SBATCH is a special tag that will be interpreted by SLURM (other schedulers utilise similar mechanism) when the job is submitted. When the script is run outside the scheduler it will be ignored (becuse of the # comment). This is quite useful, as it means the script can be executed outside the scheduler control and will run successfully.

Hint

When requesting resources try to request them close to what your job needs, rather than requesting the maximum. Think back to the restaurant analogy - it's easier to find a table for a group of two people than a group of eight, so your job will likely be scheduled sooner. This also ensures that the HPC resources are being used efficiently.

Advanced resource requirements¶

In some situations you might want to request specific hardware, such as chipset or fast network interconects. This can be achived with the use of --constrain option.

To request a specific type of GPU a100 you would use

#SBATCH --constrain=a100

or to request a specific type of processor/architecture you would use

#SBATCH --constrain=haswell

Job log files¶

By default the log files will be placed in the directory you have made your submission from (i.e. current working directory) in the format of slurm-jobid.out. Both stdout and stderr streams will be redirected from the job to that file. These log files are important as they will give you clueues about the execution of your application in particular why it has failed.

You can modify this to suit your needs by explicitly defining different path

#SBATCH --output=/scratch/users/%u/%j.out

You can also separate the stdout and stderr into separate log files

#SBATCH --output=/scratch/users/%u/%j.out
#SBATCH --error=/scratch/users/%u/%j.err

Info

%u and %j are replacement symbols (representing username and job id) that will be replaced with actual values once the job is submitted. Please see file patterns section for details.

Exercises - submitting jobs¶

Work through the exercises in this section to practice submitting and debugging jobs.

Parallel jobs¶

The main advantage of using an HPC system is the ability to utilise its large compute power to run jobs in parallel.

Important

When considering running parallel jobs make sure to consult your application documentation to find out if it can be run in the parallel environment. Nowadays most applications will support some level of parallelism. Many scientific software tools will have -p or -t options to specify numbers of CPUs to be used when running in parallel. Applications that can make use of multiple nodes are less common.

If the application does not support parallelism, requesting additional resources will not improve performance, and will likely lead to longer waiting times for your job to be scheduled. It also leads to resources being wasted as they are allocated to your job but are unused.

If you request multiple CPUs/nodes for your job, it's a good idea to check how effectively your job uses them using sacct, as discussed in the job monitoring section. Most applications do not scale infinitely and will reach a point where the marginal impact of allocating more resources is minimal.

Multithreaded/multicore (SMP) jobs¶

These type of jobs will occupy multiple cores on a single node often using a method known as OpenMP. The program which we want to run must be designed to support running multithreaded jobs. You can request those using the following script:

#SBATCH --job-name=omp_hello
#SBATCH --partition=cpu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=2G
#SBATCH --reservation=cpu_introduction
#SBATCH -t 0-0:02 # time (D-HH:MM)

/datasets/hpc_training/utils/omp_hello

One possible output would be:

Hello World from OpenMP thread 2 of 4
Hello World from OpenMP thread 3 of 4
Hello World from OpenMP thread 0 of 4
Hello World from OpenMP thread 1 of 4

Note here that the lines don't come out in any particular order - each time you run the program you might end up with a different result. This is because the program doesn't make any attempt to synchronise the printing of each line, it just executes all of them in parallel.

Hint

You can also request memory per cpu, rather than per node using the --mem-per-cpu option. --mem and --mem-per-cpu are mutually exclusive meaning you can use one, or the other in your resource request.

Array jobs¶

Array jobs offer a mechanism for submitting and managing collections of similar jobs quickly and easily. All jobs will have the same initial options (e.g. memory, cpu, runtime, etc.) and will run the same commands. Using array jobs is an easy way to parallelise your workloads, as long as the following is true:

Each array task can run independently of the others and there are no dependencies between the different components (embarassingly parallel problem).
There is no requirement for all of the array tasks to run simultaneously.
You can link the array task id (SLURM_ARRAY_TASK_ID) somehow to your data, or execution of your application.

To define an array job you will be using an --array=range[:step][%max_active] option:

range defines index values and can consist of comma separated list and/or a range of values with a "-" separator, e.g. 1,2,3,4, or 1-4 or 1,2-4
step defines the increment between the index values, i.e. 0-15:4 would be equivalent to 0,4,8,12
max_active defines number of simultaneously running tasks at any give time, i.e. 1-10%2 means only two array tasks can run simultaneously for the given array job

A sample array job is given below:

#!/bin/bash -l
#SBATCH --job-name=array-sample
#SBATCH --partition=cpu
#SBATCH --ntasks=1
#SBATCH --mem=1G
#SBATCH --reservation=cpu_introduction
#SBATCH -t 0-0:02 # time (D-HH:MM)
#SBATCH --array=1-3

echo "Array job - task id: $SLURM_ARRAY_TASK_ID"

Info

When the job starts running a separate job id in the format jobid_taskid will be assigned to each of the tasks.

As a result, the array job will produce a separate log file for each of the tasks, i.e. you will see multiple files in the slurm-jobid_taskid.out format.

MPI jobs¶

Sometimes you might want to utilise resources on multiple nodes simultaneously to perform computations.

As mentioned earlier, requesting the resource by itself will not make your application run in parallel - the application has to support parallel execution.

Message Passing Interface (MPI) is standard designed for parallel execution and it allows programs to exploit multiple processing cores in parallel.

Although MPI programming is beyond the scope of this course, if your application uses, or supports MPI then it can be executed on multiple nodes in parallel. For example, given the following submission script:

#!/bin/bash -l
#SBATCH --job-name=multinode-test
#SBATCH --partition=cpu
#SBATCH --nodes=2
#SBATCH --ntasks=16
#SBATCH --mem=2G
#SBATCH --reservation=cpu_introduction
#SBATCH -t 0-0:05 # time (D-HH:MM)

module load openmpi/4.1.3-gcc-10.3.0-python3+-chk-version

mpirun /datasets/hpc_training/utils/mpi_hello

A sample output would be

Hello world from process 11 of 16 on host erc-hpc-comp006
Hello world from process 2 of 16 on host erc-hpc-comp005
Hello world from process 15 of 16 on host erc-hpc-comp006
Hello world from process 13 of 16 on host erc-hpc-comp006
Hello world from process 12 of 16 on host erc-hpc-comp006
Hello world from process 1 of 16 on host erc-hpc-comp005
Hello world from process 14 of 16 on host erc-hpc-comp006
Hello world from process 0 of 16 on host erc-hpc-comp005
Hello world from process 3 of 16 on host erc-hpc-comp005
Hello world from process 9 of 16 on host erc-hpc-comp006
Hello world from process 7 of 16 on host erc-hpc-comp005
Hello world from process 6 of 16 on host erc-hpc-comp005
Hello world from process 10 of 16 on host erc-hpc-comp006
Hello world from process 8 of 16 on host erc-hpc-comp006
Hello world from process 4 of 16 on host erc-hpc-comp005
Hello world from process 5 of 16 on host erc-hpc-comp005

GPU jobs¶

GPU jobs utilise GPUs present in the system. You can request those using the following script - note that this time we're using the gpu partition and the gpu_introduction reservation:

#SBATCH --job-name=gpu-job
#SBATCH --partition=gpu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G
#SBATCH --reservation=gpu_introduction
#SBATCH -t 0-0:02 # time (D-HH:MM)
#SBATCH --gres gpu:1

nvidia-smi --id=$CUDA_VISIBLE_DEVICES

A sample output would be:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K40c          On   | 00000000:08:00.0 Off |                    0 |
| 23%   32C    P8    23W / 235W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Hint

You can request X gpus (up to 4) using --gres gpu:X

Exercises - parallel jobs and benchmarking¶

Work through the exercises in this section to practice submitting parallel jobs, and this section to look at optimisation and benchmarking.