Note: 📢 This document is based on my understanding of SLURM and is in no way a detailed guide covering every single topic. Take this as a practical guide from a noob’s perspective diving into it.

Introduction:

This guide is designed to help you effectively use the SLURM scheduler on Rocky Linux 9.5 server. The Server allows you to run computational jobs using both interactive and non-interactive modes. The goal here is to make a compilation farm, although this guide specifically focuses on compiling the Linux kernel, one should note that this may also be used to compile any other tool given that the prerequisites and dependencies are known.

Lab Infrastructure:

The following are all on VMware ESXI

  1. Master:

    • CPUs 4
    • Memory 4 GB
    • Hard disk 20 GB
    • Hostname: master
  2. Node 1:

    • CPUs 4
    • Memory 4 GB
    • Hard disk 40 GB
    • Hostname: node1
  3. Node 2:

    • CPUs 8
    • Memory 8 GB
    • Hard disk 40 GB
    • Hostname: node2
  4. Network File Storage

    • Since compiling creates dozens of files, at least 30 GB is required for a successful compilation.
    • Used the existing testing server assigned to me.
    • NFS share path located in /mnt/slrum_share

Every instance has Rocky Linux 9.5 installed with SSH, root login and defined IP of all 4 nodes in the /etc/hosts file.

Chapter 1: The installation:

  1. Install and configure dependencies

    Installation of slurm requires EPEL repo to be installed across all instances, install and enable it via:

    dnf config-manager --set-enabled crb
    dnf install epel-release
    sudo dnf groupinstall "Development Tools"
    sudo dnf install munge munge-devel rpm-build rpmdevtools python3 gcc make openssl-devel pam-devel

    MUNGE is an authentication mechanism for secure communication between Slurm components. Configure it on all instances using:

    sudo useradd munge
    sudo mkdir -p /etc/munge /var/log/munge /var/run/munge
    sudo chown munge:munge /usr/local/var/run/munge
    sudo chmod 0755 /usr/local/var/run/munge

    On Master:

    sudo /usr/sbin/create-munge-key
    sudo chown munge:munge /etc/munge/munge.key
    sudo chmod 0400 /etc/munge/munge.key

    Copy the key to both nodes:

    scp /etc/munge/munge.key root@node1:/etc/munge/
    scp /etc/munge/munge.key root@node2:/etc/munge/

    Start and enable the service:

    sudo systemctl enable --now munge
  2. Installation of SLURM

    Slurm is available in the EPEL repo. Install on all 3 instances:

    sudo dnf install slurm slurm-slurmd slurm-slurmctld slurm-perlapi

    If by any chance packages are not available, download tar file from SchedMD Downloads, extract, compile, and install using:

    make -j$(nproc)
    sudo make install

Chapter 2: The Configuration:

  1. Slurm configuration

    On all 3 instances:

    sudo useradd slurm
    sudo mkdir -p /etc/slurm /var/spool/slurmctld /var/spool/slurmd /var/log/slurm
    sudo chown slurm:slurm /var/spool/slurmctld /var/spool/slurmd /var/log/slurm

    Edit the configuration on master:

    sudo nano /etc/slurm/slurm.conf

    Ensure the following key lines are present and correctly configured:

    ClusterName=debug
    SlurmUser=slurm
    ControlMachine=slurm-master
    SlurmctldPort=6817
    SlurmdPort=6818
    AuthType=auth/munge
    StateSaveLocation=/var/spool/slurmctld
    SlurmdSpoolDir=/var/spool/slurmd
    SwitchType=switch/none
    MpiDefault=none
    SlurmctldPidFile=/var/run/slurmctld.pid
    SlurmdPidFile=/var/run/slurmd.pid
    ProctrackType=proctrack/pgid
    ReturnToService=1
    SchedulerType=sched/backfill
    SlurmctldTimeout=300
    SlurmdTimeout=30
    NodeName=node1 CPUs=4 RealMemory=3657 State=UNKNOWN
    NodeName=node2 CPUs=8 RealMemory=7682 State=UNKNOWN
    PartitionName=debug Nodes=node[1-2] Default=YES MaxTime=INFINITE State=UP

    Copy configuration to nodes:

    scp /etc/slurm/slurm.conf root@node1:/etc/slurm/slurm.conf
    scp /etc/slurm/slurm.conf root@node2:/etc/slurm/slurm.conf

    Start and enable services:

    sudo systemctl enable --now slurmctld
    sudo systemctl enable --now slurmd
  2. Firewall Configuration:

    Open required ports:

    sudo firewall-cmd --permanent --add-port=6817/tcp
    sudo firewall-cmd --permanent --add-port=6818/tcp
    sudo firewall-cmd --permanent --add-port=6819/tcp
    sudo firewall-cmd --reload

Chapter 3: Testing and Introduction to the commands:

(While this is a short guide on the commands and its flags, you could always use man pages to understand it more deeply)

  1. sinfo:

    Displays node and partition information:

    sinfo
  2. srun:

    Runs commands interactively on compute nodes:

    srun -N2 -n2 nproc
  3. sbatch:

    Submits a job script:

    sbatch testjob.sh
  4. squeue:

    Displays details of currently running jobs:

    squeue
  5. scancel:

    Cancels a submitted job:

    scancel 1
  6. scontrol:

    Displays detailed job and node information:

    scontrol show job 1
    scontrol show partition

Chapter 4: Setting up the NFS storage.

It is a good idea to have shared storage for SLURM. Install nfs-utils:

sudo dnf install nfs-utils

On the NFS server:

mkdir /srv/slurm_share
nano /etc/exports

Add the following line:

/srv/slurm_share 10.10.40.0/24(rw,sync,no_subtree_check,no_root_squash)

Open necessary ports:

firewall-cmd --permanent --add-service=rpc-bind
firewall-cmd --permanent --add-port={5555/tcp,5555/udp,6666/tcp,6666/udp}
firewall-cmd --reload

Export and enable the service:

exportfs -v
systemctl enable --now nfs-server

On the master and compute nodes:

sudo mkdir /mnt/slurm_share

Add the mount in /etc/fstab:

10.10.40.0:/srv/slurm_share /mnt/slurm_share nfs defaults 0 0

Reboot machines and verify the share mounts properly.

Chapter 5: Setting up the Compile/Build Environment

Install kernel build dependencies:

srun -n2 -N2 sudo dnf groupinstall "Development Tools" -y && sudo dnf install ncurses-devel bison flex elfutils-libelf-devel openssl-devel wget bc dwarves -y

Download the Linux kernel source from kernel.org:

wget [https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.14.8.tar.xz](https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.14.8.tar.xz)
tar xvf linux-6.14.8.tar.xz

Define architecture-specific config:

make defconfig

Create compile_kernel.sh on the shared directory:

#!/bin/bash
#SBATCH --job-name=kernel_build
#SBATCH --output=kernel_build_%j.out
#SBATCH --error=kernel_build_%j.err
#SBATCH --time=03:00:00
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=8G

KERNEL_SOURCE_PATH="/mnt/slurm_share/linux-6.8.9"
BUILD_OUTPUT_DIR="/mnt/slurm_share/kernel_builds/${SLURM_JOB_ID}"

mkdir -p "$BUILD_OUTPUT_DIR"
cd "$KERNEL_SOURCE_PATH"

NUM_MAKE_JOBS=${SLURM_CPUS_PER_TASK}
make -j"${NUM_MAKE_JOBS}" ARCH=x86_64 Image modules dtbs

if [ $? -eq 0 ]; then
cp "$KERNEL_SOURCE_PATH/arch/x86/boot/bzImage" "$BUILD_OUTPUT_DIR/"
else
echo "Kernel compilation failed."
fi