ruk·si

Slurm

Updated at 2016-12-21 08:26

Slurm is an open source cluster management and job scheduling system.

Slurm software consists of two parts:

  • slurmd: deamon running on each compute node.
  • slurmctld: deamon running on a management node.

Slurm logical entities:

  • Node: a compute resource
  • Partition: possibly overlapping group of nodes
  • Job: allocation of resouces assigned to a user for a specific amount of time
  • Job Step: set of possibly parallel tasks within a job

Partitions can be considered job queues, each has assortment of constraints such as:

  • job size limit
  • job time limit
  • users permitted to use it
  • TODO: etc?

Commands:

  • sacct: report job or step information about active or completed jobs.
  • salloc: used to allocate resources for a job in real time.
  • sattach: attach STDIN, STDOUT and STDERR to currently running job or step.
  • sbatch: used to submit a job script with multiple parallel sruns for later execution.
  • sbcast: used to transfer a file from local disk to local disk on nodes allocated to a job.
  • scancel: used to send arbitrary signals to all processes associated with a job or step, usually to send cancel for a pending or running job.
  • scontrol: administrative tool used to view or modify Slurm state.
  • sinfo: reports state of partitions and nodes managed by Slurm.
  • smap: reports state of jobs, partitions and nodes as graphical display.
  • squeue: reports sate of jobs and steps.
  • srun: used to submit a job or step for execution in real time.
  • strigger: set, get, or view event triggers.
  • sview: graphical UI to get and update state information for jobs, partitions and nodes.
# Here we can see that we have 2 partitions with 5 and 10 nodes.
sinfo
PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
debug*       up      30:00     2  down* adev[1-2]
debug*       up      30:00     3   idle adev[3-5]
batch        up      30:00     3  down* adev[6,13,15]
batch        up      30:00     3  alloc adev[7-8,14]
batch        up      30:00     4   idle adev[9-12]

# We have 3 jobs in queue, 2 of them running and one pending.
squeue
JOBID PARTITION  NAME  USER ST  TIME NODES NODELIST(REASON)
65646     batch  chem  mike  R 24:19     2 adev[7-8]
65647     batch   bio  joan  R  0:09     1 adev14
65648     batch  math  phil PD  0:00     6 (Resources)

# scontrol shows more details
scontrol show partition
PartitionName=debug TotalNodes=5 TotalCPUs=40 RootOnly=NO
   Default=YES OverSubscribe=FORCE:4 PriorityTier=1 State=UP
   MaxTime=00:30:00 Hidden=NO
   MinNodes=1 MaxNodes=26 DisableRootJobs=NO AllowGroups=ALL
   Nodes=adev[1-5] NodeIndices=0-4

PartitionName=batch TotalNodes=10 TotalCPUs=80 RootOnly=NO
   Default=NO OverSubscribe=FORCE:4 PriorityTier=1 State=UP
   MaxTime=16:00:00 Hidden=NO
   MinNodes=1 MaxNodes=26 DisableRootJobs=NO AllowGroups=ALL
   Nodes=adev[6-15] NodeIndices=5-14

# Execute /bin/hostname on three nodes (-N3).
srun -N3 -l /bin/hostname
0: adev3
1: adev4
2: adev5

# Executes /bin/hostname in four tasks (-n4), one CPU per task by default.
srun -n4 -l /bin/hostname
0: adev3
1: adev3
2: adev3
3: adev3

# Queue a specific script for later on specific nodes.
cat my.script
#!/bin/sh
#SBATCH --time=1
/bin/hostname
srun -l /bin/hostname
srun -l /bin/pwd

sbatch -n4 -w "adev[9-10]" -o my.stdout my.script
sbatch: Submitted batch job 469

cat my.stdout

# 1) define resource
# 2) transfer the executable program a.out to /tmp/joe.a.out on local storages
# 3) run it on nodes
# 4) delete it from nodes
# 5) exit
salloc -N1024 bash
sbcast a.out /tmp/joe.a.out
srun /tmp/joe.a.out
srun rm /tmp/joe.a.out
exit

# Submit a batch job, get its status and cancel it.
sbatch test
squeue
scancel 473
squeue

Sources