Note: 📢 This document is based on my understanding of SLURM and is in no way a detailed guide covering every single topic. Take this as a practical guide from a noob’s perspective diving into it.
Introduction:
This guide is designed to help you effectively use the SLURM scheduler on Rocky Linux 9.5 server. The Server allows you to run computational jobs using both interactive and non-interactive modes. The goal here is to make a compilation farm, although this guide specifically focuses on compiling the Linux kernel, one should note that this may also be used to compile any other tool given that the prerequisites and dependencies are known.
Lab Infrastructure:
The following are all on VMware ESXI
Master:
- CPUs 4
- Memory 4 GB
- Hard disk 20 GB
- Hostname: master
Node 1:
- CPUs 4
- Memory 4 GB
- Hard disk 40 GB
- Hostname: node1
Node 2:
- CPUs 8
- Memory 8 GB
- Hard disk 40 GB
- Hostname: node2
Network File Storage
- Since compiling creates dozens of files, at least 30 GB is required for a successful compilation.
- Used the existing testing server assigned to me.
- NFS share path located in /mnt/slrum_share
Every instance has Rocky Linux 9.5 installed with SSH, root login and defined IP of all 4 nodes in the /etc/hosts file.
Chapter 1: The installation:
Install and configure dependencies
Installation of slurm requires EPEL repo to be installed across all instances, install and enable it via:
dnf config-manager --set-enabled crb dnf install epel-release sudo dnf groupinstall "Development Tools" sudo dnf install munge munge-devel rpm-build rpmdevtools python3 gcc make openssl-devel pam-develMUNGE is an authentication mechanism for secure communication between Slurm components. Configure it on all instances using:
sudo useradd munge sudo mkdir -p /etc/munge /var/log/munge /var/run/munge sudo chown munge:munge /usr/local/var/run/munge sudo chmod 0755 /usr/local/var/run/mungeOn Master:
sudo /usr/sbin/create-munge-key sudo chown munge:munge /etc/munge/munge.key sudo chmod 0400 /etc/munge/munge.keyCopy the key to both nodes:
scp /etc/munge/munge.key root@node1:/etc/munge/ scp /etc/munge/munge.key root@node2:/etc/munge/Start and enable the service:
sudo systemctl enable --now mungeInstallation of SLURM
Slurm is available in the EPEL repo. Install on all 3 instances:
sudo dnf install slurm slurm-slurmd slurm-slurmctld slurm-perlapiIf by any chance packages are not available, download tar file from SchedMD Downloads, extract, compile, and install using:
make -j$(nproc) sudo make install
Chapter 2: The Configuration:
Slurm configuration
On all 3 instances:
sudo useradd slurm sudo mkdir -p /etc/slurm /var/spool/slurmctld /var/spool/slurmd /var/log/slurm sudo chown slurm:slurm /var/spool/slurmctld /var/spool/slurmd /var/log/slurmEdit the configuration on master:
sudo nano /etc/slurm/slurm.confEnsure the following key lines are present and correctly configured:
ClusterName=debug SlurmUser=slurm ControlMachine=slurm-master SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge StateSaveLocation=/var/spool/slurmctld SlurmdSpoolDir=/var/spool/slurmd SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid ProctrackType=proctrack/pgid ReturnToService=1 SchedulerType=sched/backfill SlurmctldTimeout=300 SlurmdTimeout=30 NodeName=node1 CPUs=4 RealMemory=3657 State=UNKNOWN NodeName=node2 CPUs=8 RealMemory=7682 State=UNKNOWN PartitionName=debug Nodes=node[1-2] Default=YES MaxTime=INFINITE State=UPCopy configuration to nodes:
scp /etc/slurm/slurm.conf root@node1:/etc/slurm/slurm.conf scp /etc/slurm/slurm.conf root@node2:/etc/slurm/slurm.confStart and enable services:
sudo systemctl enable --now slurmctld sudo systemctl enable --now slurmdFirewall Configuration:
Open required ports:
sudo firewall-cmd --permanent --add-port=6817/tcp sudo firewall-cmd --permanent --add-port=6818/tcp sudo firewall-cmd --permanent --add-port=6819/tcp sudo firewall-cmd --reload
Chapter 3: Testing and Introduction to the commands:
(While this is a short guide on the commands and its flags, you could always use man pages to understand it more deeply)
sinfo:Displays node and partition information:
sinfosrun:Runs commands interactively on compute nodes:
srun -N2 -n2 nprocsbatch:Submits a job script:
sbatch testjob.shsqueue:Displays details of currently running jobs:
squeuescancel:Cancels a submitted job:
scancel 1scontrol:Displays detailed job and node information:
scontrol show job 1scontrol show partition
Chapter 4: Setting up the NFS storage.
It is a good idea to have shared storage for SLURM. Install nfs-utils:
sudo dnf install nfs-utilsOn the NFS server:
mkdir /srv/slurm_share
nano /etc/exportsAdd the following line:
/srv/slurm_share 10.10.40.0/24(rw,sync,no_subtree_check,no_root_squash)
Open necessary ports:
firewall-cmd --permanent --add-service=rpc-bind
firewall-cmd --permanent --add-port={5555/tcp,5555/udp,6666/tcp,6666/udp}
firewall-cmd --reloadExport and enable the service:
exportfs -v
systemctl enable --now nfs-serverOn the master and compute nodes:
sudo mkdir /mnt/slurm_shareAdd the mount in /etc/fstab:
10.10.40.0:/srv/slurm_share /mnt/slurm_share nfs defaults 0 0
Reboot machines and verify the share mounts properly.
Chapter 5: Setting up the Compile/Build Environment
Install kernel build dependencies:
srun -n2 -N2 sudo dnf groupinstall "Development Tools" -y && sudo dnf install ncurses-devel bison flex elfutils-libelf-devel openssl-devel wget bc dwarves -yDownload the Linux kernel source from kernel.org:
wget [https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.14.8.tar.xz](https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.14.8.tar.xz)
tar xvf linux-6.14.8.tar.xzDefine architecture-specific config:
make defconfigCreate compile_kernel.sh on the shared directory:
#!/bin/bash
#SBATCH --job-name=kernel_build
#SBATCH --output=kernel_build_%j.out
#SBATCH --error=kernel_build_%j.err
#SBATCH --time=03:00:00
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=8G
KERNEL_SOURCE_PATH="/mnt/slurm_share/linux-6.8.9"
BUILD_OUTPUT_DIR="/mnt/slurm_share/kernel_builds/${SLURM_JOB_ID}"
mkdir -p "$BUILD_OUTPUT_DIR"
cd "$KERNEL_SOURCE_PATH"
NUM_MAKE_JOBS=${SLURM_CPUS_PER_TASK}
make -j"${NUM_MAKE_JOBS}" ARCH=x86_64 Image modules dtbs
if [ $? -eq 0 ]; then
cp "$KERNEL_SOURCE_PATH/arch/x86/boot/bzImage" "$BUILD_OUTPUT_DIR/"
else
echo "Kernel compilation failed."
fi