Cluster-Handbook/SLURM

SLURM

A cluster is a network of resources that needs to be controlled and managed in order to achieve an error free process. The node must be able to communicate with each other, wherein there are two categories generally - the login node (also master or server node) and the worker nodes (also client node). It is common that users can't access the worker nodes directly, but can run programs on all nodes. Usually there are several users who claim the resources for themselves. The distribution of these can not therefore be set by users themselves, but according to specific rules and strategies. All these tasks are taken over by the job and resource management system - a batch system.

Batch-System overview

A batch system is a service for managing resources and also the interface for the user. The user sends jobs - tasks with executions and a description of the needed resources and conditions. All the jobs must be managed by the batch system. Major components of a batch system are a server and the clients. The server is the main component and provides an interface for monitoring. The main task of the server is managing resources to the registered clients. The main task of the clients is the execution of the pending programs. A client also collects all information about the course of the programs and system status. This information can be provided on request for the server. A third optional component of a batch system is the scheduler. Some batch systems have a built-in scheduler, but all give the option to integrate an external scheduler in the system. The scheduler sets according to certain rules: who, when and how many resources can be used. A batch system with all the components mentioned is SLURM, which is presented in this section with background information and instructions for installing and using. Qlustar also has an SLURM integration.

SLURM Basics

SLURM (Simple Linux Utility for Resource Management) is a free batch-system with an integrated job scheduler. SLURM was created in 2002 from the joint effort mainly by Lawrence Livermore National Laboratory, SchedMD, Linux Networx, Hewlett-Packard, and Groupe Bull. Soon, more than 100 developers had contributed to the project. The result of the efforts is a software that is used in many high-performance computers of the TOP-500 list (also the currently fastest Tianhe-2 ^[1]). SLURM is characterized by a very high fault tolerance, scalability and efficiency. There are backups for daemons (see section 5.3) and various options to dynamically respond to errors. It can manage more than 100,000 jobs, up to 1000 jobs per second, with up to 600 jobs per second that can be executed. Currently unused nodes can be shut down in order to save power. Moreover SLURM has a fairly high level of compatibility to a variety of operating systems. Originally developed for Linux, today many more platforms are supported: AIX, *BSD (FreeBSD, NetBSD, and OpenBSD), Mac OS X, Solaris. It is also possible to crosslink different systems and run jobs on them. For scheduling a mature concept has been developed with a variety of options. With policy options many levels can be produced, which each can be managed. Thus, a database can be integrated in which user groups and projects can be recorded, which are subject to their own rules. Also the user can be attributed rights as part of its group or project. SLURM is an active project that is being developed. In 2010, the developers founded the company SchedMD and offer paid support for SLURM on. S

Setup

The heart of SLURM are two daemons ^[2] - slurmctld and slurmd. Both can have a backup. The Controldaemon, as the name suggests, is running on the server. It initializes, controls and logs all activity of the resource manager. This service is divided into three parts - the Job Manager, which manages the queue with waiting jobs, the Node Manager, that holds status information of the node and the partition manager, which allocates the node. The second daemon runs on each client. He performs all the instructions from slurmctld and srun. With the special command scontrol the client extends further its status information to the server. If the connection is established, diverse SLURM commands can be accessed from the server. Some of these can theoretically be called from the client, but usually are carried out only on the server.

Figure 5.1: Infrastructure SLURM with the two most important services.

In the picture 5.1, some commands are exemplified. The five most important are explained in detail below.

sinfo

This command displays the node and partition information. With additional options the output can be filtered and sorted.

$ sinfo
PARTITION   AVAIL   TIMELIMIT   NODES   STATE   NODELIST
debug*         up    infinite       1   down*   worker5
debug*         up    infinite       1   alloc   worker4

The column PARTITION shows the name of the partition. Asterisk means that it is a default name. The column AVAIL refers to the partition and can show up or down. TIME LIMIT displays the user-specified time limit. Unless specified, the value is assumed to be infinite. STATE indicates the status of NODES the listed nodes. Possible states are $allocated,completing,down,drained,draining,fail,failing,idle$ and $unknown$ , wherein the respective abbreviations are as following: alloc, comp, donw, drain, drng, fail, failg, idle and unk. An asterisk means that there was no feedback from the node obtained. NodeList shows node names set in the configuration file. The command can be given several options that can query on the one hand the additional information, and on the other can format the output as desired. Complete list - https://computing.llnl.gov/linux/slurm/sinfo.html The column PARTITION shows the name of the partition. Asterisk means that it is a default name. The column AVAIL refers to the partition and can show up or down. TIME LIMIT displays the user-specified time limit. Unless specified, the value is assumed to be infinite. STATE indicates the status of NODES the listed nodes. Possible states are $allocated,completing,down,drained,draining,fail,failing,idle$ and $unknown$ , wherein the respective abbreviations are as following: alloc, comp, donw, drain, drng, fail, failg, idle and unk. An asterisk means that there was no feedback from the node obtained. NodeList shows node names set in the configuration file. The command can be given several options that can query on the one hand the additional information, and on the other can format the output as desired. Complete list - https://computing.llnl.gov/linux/slurm/sinfo.html

srun

With this command you can interactively send jobs and/or allocate nodes.

$ srun -N 2 -n 2 hostname In this example, you want to execute 2 nodes and total (not per node) 2 CPUs hostname.

$ srun -n 2 --partition=pdebug --allocate With the option –allocate you allocate reserved resources. In this context, programs can be run, which do not go beyond the scope of the registered resources.
A complete list of options - https://computing.llnl.gov/linux/slurm/srun.html

scancel

This command is used to abort a job or one or more job steps. As parameters, you pass the ID of the job that has to be stopped. It depends on the user’s rights, what jobs he is allowed to cancel.

$ scancel --state=PENDING --user=fuchs --partition=debug In the example you want to cancel all jobs that are in the $pending$ state, belong to the user fuchs and are in the partition debug. If you do not have permission the output will show accordingly.
Complete list of options - https://computing.llnl.gov/linux/slurm/scancel.html

squeue

This command displays the job-specific set of information. Again, the output and the extent of information on additional options can be controlled.

$ squeue
JOBID PARTITION    NAME  USER ST TIME NODES NODELIST(REASON)
    2     debug script1 fuchs PD 0:00     2 (Ressources)    
    3     debug script2 fuchs  R 0:03     1 worker4

JOBID indicates the identification number of the jobs. The column NAME shows the corresponding ID name of the job, and again, it can be modified or extended manually. ST is the status of the job, as in the example PD - $pending$ or R - $running$ (there are many other statuses). Accordingly, the clock is running under TIME only for the job whose status is set to $running$ . This time is no limit, but the current running time of the jobs. Is the job on hold, the timer remains at 0:00. The reasons why the job is not running, is shown in the next columns. Under NODES you see the number of nodes required for the job. The current job needs only one node, the waiting one two. The last column also shows the reason as to why, the job is not running, such as (Ressources). For more detailed information other options must be passed.
A complete list - https://computing.llnl.gov/linux/slurm/squeue.html

scontrol

This command is used, to view or modify the SLURM configuration of one or more jobs. Most operations can be performed only by the root. One can write the desired options and commands directly after the call or call scontrol alone and continue working in this context.

$ scontrol
scontrol: show job 3
JobId=3 Name=hostname

   UserId=da(1000) GroupId=da(1000)

   Priority=2 Account=none QOS=normal WCKey=*123

   JobState=COMPLETED Reason=None Dependency=(null)

   TimeLimit=UNLIMITED Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0

   SubmitTime=2013-02-18T10:58:40 EligibleTime=2013-02-18T10:58:40

   StartTime=2013-02-18T10:58:40 EndTime=2013-02-18T10:58:40

   SuspendTime=None SecsPreSuspend=0

   Partition=debug AllocNode:Sid=worker4:4702

   ReqNodeList=(null) ExcNodeList=(null)

   NodeList=snowflake0

   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=1:1:1

   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0

   Features=(null) Reservation=(null) 
scontrol: update JobId=5 TimeLimit=5:00 Priority=10

This example shows the use of the command from the context. It can be set how much information is obtained with specific queries.
A complete list of options - https://computing.llnl.gov/linux/slurm/scontrol.html

Mode of operation

There are basically two modes of operation. You can call a compiled program from the server interactively. Here you can give a number of options, such as the number of nodes and processes on which the program is to run. It is much cheaper to write job scripts, in which the options may be kept clear or commented on. The following sections provide the syntax and semantics of the options in the interactive mode and also jobscripts will be explained.

Interactive

The key command here is srun. Thus one can perform the jobs interactively and allocate resources. This is followed by options that are passed to the batch system. There are options whose values are set for the environment variables in SLURM (see Section 5.3).

Jobscript

A jobscript is a file with for example shell-commands. There are no in- or output parameters. In a jobscript, the environment variables are set directly. Therefore the lines need to marked with #SBATCH. Other lines, which start with a hash are comments. The main part of a job script is the program call. Optionally, however, additional parameters can be passed like in the interactive mode.

#!/bin/sh

# Time limit in minutes.
#SBATCH --time=1
# A total of 10 processes at 5 knots
#SBATCH -N 5 -n 10
# Output after job.out, Errors after job.err.
#SBATCH --error=job.err --output=job.out

srun hostname

The script is called as usual and runs all included commands. A job script can contain several program calls also from different programs. Each program call can have redundant options attached, which can override the environment variables. If there are no further details the in the beginning of the script specified options apply.

Installation

Installing SLURM can as well as other software either via a finished Ubuntu package, or manually, which is significantly more complex. For the current version, there is still not a finished package. If the advantages of the newer version listed below are not required, the older version is sufficient. In the following sections for both methods a instructions is provided. Improvements in version 2.5.3:

Race conditions eliminated at job dependencies
Effective cleanup of terminated jobs
Correct handling of glib and gtk
Bugs fixed for newer compilers
Better GPU-Support

Package

The prefabricated package contains the version 2.3.2. With

$ apt-get install slurm-llnl you download the package. In the second step, a configuration file must be created. There is a website that accepts manual entries and generates the file automatically - https://computing.llnl.gov/linux/slurm/configurator.html. In this case, only the name of the machine has to be adjusted. Most importantly, this file must be identical on all clients and on the server. The daemons are started automatically. With sinfo you can quickly check if everything went well.

Manually

Via a link you download the archive with the latest version - 2.5.3. The archive must be unpacked.

$ wget http://schedmd.com/download/latest/slurm-2.5.3.tar.bz2 $ tar --bzip -x -f slurm*tar.bz2 Furthermore the configure must be called. It would suffice to indicate no options, however, one should make sure that the directory is the same on the server and the clients, because you would have to install it on each client individually otherwise. If gcc and make are available, you install SLURM.

$ ./configure --prefix=/opt/slurm_2.5.3 $ make $ make install As in the last section, the configuration file must be created (see link above) even with manual installation. The name of the machine have to be adjusted. The option ReturnToService should get the value of 2. After the failure of a node it is otherwise no longer recruited, even if he is available again. In addition, $Munge$ should be selected.

Munge

Munge is a highly scalable authentication service. It is needed for the client node to respond only to requests from the “real” server, not from any. For this, a key must be generated and copied to all associated client nodes and the server. Munge can be installed as usual. In order that authentication requests are performed correctly, the clocks must be synchronized on all machines. For this the ntp-service is enough. The relative error is within the tolerance range of Munge. As server the SLURM server should also be registered.
After Munge was installed, a system user has been added, which was specified in the configuration file (by default slurm).

$ sudo adduser --system slurm The goal is that each user can submit jobs and execute SLURM commands from his working directory. This requires that the PATH is customized in /etc/environment. Add this following path:

/opt/slurm_2.5.3/sbin: The path refers of course to the directory where SLURM has been installed and can also differ. Now it is possible for the user to call from any directory SLURM commands. However, this does not include sudo commands, because they do not read from the environment variable path. For this you use the following detour:

$ sudo $(which <SLURM-Command>) With which one gets the full path of the command that is read from the modified environment variables. This path you pass to the sudo command. It may be useful, because after manual installation both daemons must be started manually. On the server the slurmctld and the slurmd on the client machines is executed. With additional options -D -vvvv you can see the error messages in more detail if something went wrong. -D stands for debugging and -v for $verbose$ . The more “v”s are strung together, the more detailed is the output.

Scheduler

The scheduler is a arbitration logic which controls the time sequence of the jobs. This section covers the internal scheduler of SLURM with various options. Also potential external scheduler get covered.

Internal Scheduler

In the configuration file three methods can be defined - $builtin,backfill$ and $gang.$

Builtin

This method works on the FIFO principle without further intervention.

Backfill

The backfill process is a kind of FIFO with efficient allocation. Once a job requires currently free resources and is queued behind other jobs, whose resource claims currently can not be satisfied, the “minor” job is preferred. The by the user defined time limit is relevant.

Figure 5.2: Auswirkungen von FIFO und Backfill Strategien.

In the graphic [fig:schedul] left is shown a starting situation in which there are three jobs. Job1 and job3 just need one node, while job2 needs two. Both nodes are not busy in the beginning. Following the FIFO strategy job1 would be carried out and block a node. The other two have to wait in the queue, although a node is still available. Following the backfill strategy job3 would be preferred to the job2 since the required resources for job3 are currently free. Very important here is the prescribed time limit. Job3 is finished before job1 and would thus not prevent job2’s execution, if both nodes would become available. If job3 had a longer time-out, this job would not be preferred.

Gang

The Gang-scheduling has only one application, if accepted, that the resources are not allocated exclusively (option –shared must be specified). At this it jumps between time slices. Gang-scheduling causes that in the same time slots, as possible, belonging processes are handled and thus the context jumps are reduced. The SchedulerTimeSlice option specifies the length of a time slot in seconds. Within this time slice builtin- or backfill-Scheduling can be used if multiple processes compete for resources. In older versions (lower than 2.1) the distribution follows the Round-Robin principle.
To use the Gang-scheduling at least three options have to be set:

PreemptMode             = GANG
SchedulerTimeSlice      = 10
SchedulerType           = sched/builtin

By default, 4 processes would be able to allocate the same resources. With the option

Shared=FORCE:xy the number $xy$ can be defined.

Policy

However, the scheduling capabilities of SLURM are not limited to the three strategies. With the help of $Policy$ options each strategy can be refined and adjusted to the needs of users and administrators. The Policy is a kind of house rules, which are subject to any job, user, or group or project. In large systems, it will quickly become confusing when each user has his own set of specific rules. Therefore SLURM also supports database connections via MySQL or PortgreSQL. For the use of databases those need to be explicitly configured for SLURM. On the SLURM part certain options need to be set so that the policy rules can be applied to the defined groups. Detailed description - https://computing.llnl.gov/linux/slurm/accounting.html.

Most options in scheduling are targeted at the priority of jobs. SLURM uses a complex calculation method to determine the priority - $priority-multifactor$ . Five factors play a role in the calculation - $Age$ (waiting time of a waiting jobs), $Fair-share$ (the difference between allocated and used resources), $Jobsize$ (number of allocated nodes), $Partition$ (a factor that has been assigned to a node group), $QOS$ (a factor of service quality). Each of these factors will also receive a weighting. This means that some factors are defined as more dominant. The overall priority is the sum of the weighted factors ( $float$ values between 0.0 and 1.0):

Job_priority =
    (PriorityWeightAge) * (age_factor) +
    (PriorityWeightFairshare) * (fair-share_factor) +
    (PriorityWeightJobSize) * (job_size_factor) +
    (PriorityWeightPartition) * (partition_factor) +
    (PriorityWeightQOS) * (QOS_factor)

The detailed descriptions of the factors and their composition can be found here: https://computing.llnl.gov/linux/slurm/priority_multifactor.html.
Particularly interesting is the QOS ( $QualityOfService$ )-factor. The prerequisite for it's usage is the $multi-factor-plugin$ and that PriorityWeightQOS is nonzero. A user can specify a QOS for each job. This affects the scheduling, context jumps and limits. The allowed QOSs be specified as a comma-separated list in the database. QOSs in that list can be used by users of the associated group. The default value normal does not affect the calculations. However, if the user knows that his job is particularly short, he could define his job script as follows:

#SBATCH --qos=short This option increases the priority of the job (if properly configured), but cancels it after the time limit described in his QOS. Thus, one should take into account realistically. The available QOSs can be display with the command:

$ sacctmgr show qos The default values of the QOS look like this:^[3]

Table 5.1: Default values for QOS.
QOS	Wall time limit per job	CPU time limit per job	Total node limit for the QOS	Node limit per user
short	1 hour	512 hours	—	—
normal(Standard)	4 hours	512 hours	—	—
medium	24 hours	—	32	—
long	5 days	—	32	—
long_contrib	5 days	—	32	—
support	5 days	—	—	—

These values can be changed by the administrator in the configuration. Example:

MaxNodesPerJob=12 Complete list - https://computing.llnl.gov/linux/slurm/resource_limits.html

External Scheduler

SLURM is compatible with various other schedulers - this includes Maui, Moab, LSF and Catalina. In the configuration file Builtin should be selected if an external scheduler should be integrated.

Maui

Maui is a freely available scheduler by Adaptive Computing.^[4]) The development of Maui has been set of 2005. The package can be downloaded after registration on the manufacturing side. The version requires Java for installation. Maui features numerous policy and scheduling options, however, are now being offered by SLURM itself.

Moab

Moab is the successor of Maui. Since the project Maui was set, Adaptive Computing has developed the package under the name of Moab and under commercial license. Moab shell scale better than Maui. Also paid support is available for the product.

LSF

LSF - Load Sharing Facility - is a commercial software from IBM. The product is suitable not only for IBM machines, but also for systems with Windows or Linux operating systems.

Catalina

Catalina is a on going project for years, of which there is currently a pre-production release. It includes many features of Maui, supports grid-computing and allows guaranteeing of available nodes after as certain time. For the use Python is required.

Conclusion

The product can be used without much effort. The development of SLURM adapts to the current needs, and so it can not only be used on a small scale (less than 100 cores) but also in leading highly scalable architectures. This is supported by the reliability and sophisticated fault tolerance of SLURM. SLURMs Scheduler options leave little wishes, there are no extensions necessary. SLURM is in all a very well-done, up to date software.

References

↑ [foot:top500]http://www.top500.org/system/177999
↑ [foot:daemon] daemon - disk and execution monitor - On UNIX, a utility running in the background
↑ http://www.umbc.edu/hpcf/resources-tara/scheduling-policy.html
↑ http://www.adaptivecomputing.com/products/open-source/maui/

[1] [foot:top500]http://www.top500.org/system/177999

[2] [foot:daemon] daemon - disk and execution monitor - On UNIX, a utility running in the background

[3] ttp://www.umbc.edu/hpcf/resources-tara/scheduling-policy.html

[4] ttp://www.adaptivecomputing.com/products/open-source/maui/

[1]

[2]

[3]

[4]