Introduction
I wanted to understand how LSF actually works, not just at a surface “submit job, job runs” level, but the real mechanics. Most places that use LSF don’t document it well, and when they do, it’s written assuming you already run a production cluster.
So I decided to build a small LSF Community Edition cluster at home and treat it like an EDA-style environment: master node, execution nodes, login-style submission, interactive jobs, X11, the whole thing.
This post is not a clean guide. It’s more like notes from what actually happened.
Chapter 1: Why LSF and Not Slurm
I already have experience with Slurm, and I even wrote about using Slurm as a compilation farm. Slurm is straightforward. LSF is not.
But that’s exactly why I wanted to learn it.
LSF still shows up in EDA, semiconductor companies, and older HPC environments. And the way it thinks about hosts, submission, and execution is very different from Slurm.
In Slurm, a login node is mostly just a machine that can talk to slurmctld.
In LSF, every host must be known, typed, and classified. Even submit-only machines.
That difference alone is worth learning.
Chapter 2: The Cluster Layout I Used
I kept the setup intentionally small.
lsf-master
Runs LIM, mbatchd, acts as master candidate.lsf-node1
Execution node.fedora
My Fedora workstation, acting as a login / submission host.
All machines could resolve each other via /etc/hosts. No DNS magic.
I used Rocky Linux for lsf-master and lsf-node1, and Fedora for my submission machine.
Chapter 3: Getting LSF
This part deserves its own chapter because this is where things already got weird.
There is no obvious “download LSF” button.
I actually found the IBM Spectrum LSF Community Edition through a Reddit post. That post linked to IBM’s site, which then required me to:
- Create an IBM account
- Log in
- Accept licensing terms
- Navigate through IBM’s download portal
Only after all that was I able to download the tarball.
What I ended up with was something like:
lsfsce10.2.0.15-x86_64.tar.Z
This already tells you something important:
- IBM ships prebuilt binaries
- They are tied to a specific glibc and kernel baseline
That becomes important later when Fedora starts complaining.
Chapter 3.1: Extracting the Tarball
On the master node (lsf-master), I extracted it under /tmp first:
tar -xvf lsfsce10.2.0.15-x86_64.tar.Z
cd lsfsce10.2.0.15-x86_64
Inside, you don’t get a fancy installer. You get scripts.
The real entry point is:
./lsfinstall
But do not run it blindly.
Chapter 3.2: install.config Is the Real Installer
LSF installation is driven almost entirely by a file called:
install.config
If you mess this up, the installer will still run, but you’ll regret it later.
This is roughly what I used (simplified to the important parts):
LSF_TOP="/usr/local/lsf"
LSF_ADMINS="edauser"
LSF_CLUSTER_NAME="lsfcluster"
LSF_MASTER_LIST="lsf-master"
LSF_SERVER_HOSTS="lsf-master lsf-node1"
ENABLE_EGO="Y"
EGO_DAEMON_CONTROL="Y"
LSF_LOCAL_RESOURCES="[resource mg]"
Things that matter here:
LSF_TOP
This decides everything. Binaries, configs, logs. I kept it simple.LSF_ADMINS
This user must exist. I creatededauserbeforehand.LSF_CLUSTER_NAME
This name shows up everywhere later. Pick once, don’t change it.LSF_MASTER_LIST
This is not optional. If this is wrong, LIM will never behave.LSF_SERVER_HOSTS
These are execution-capable hosts. Submission-only hosts do NOT go here.
At this stage, Fedora was not included anywhere.
Chapter 3.3: Running the Installer
Only after editing install.config did I run:
./lsfinstall -f install.config
The installer:
- creates
/usr/local/lsf - drops binaries under versioned directories
- writes configs under
/usr/local/lsf/conf
When it finished, I had:
/usr/local/lsf/10.1/
/usr/local/lsf/conf/
/usr/local/lsf/work/
At this point, LSF was technically “installed”.
That does not mean usable.
Chapter 3.4: Environment Setup
LSF does nothing unless you source its environment and start the daemons.
On the master and nodes:
source /usr/local/lsf/conf/profile.lsf
Without this:
bhostswon’t worklsidwon’t work- errors will be extremely misleading
I added this to the admin user’s shell profile later, but initially I sourced it manually.
To start the daemons I used the commad:
lsf_daemons start
Chapter 4: The First Big Reality Check — Hosts Must Be Known
One mistake I made early was assuming that copying the tarball to another machine and sourcing profile.lsf was enough.
It is not.
LSF does not work like that.
When I ran:
bhosts
on my Fedora machine, I kept getting:
Failed in an LSF library call: LIM is down; try later
Even though:
- I could ping the master
- LIM ports were reachable
- SSH worked
The problem wasn’t networking.
The problem was that LSF didn’t know my Fedora machine existed.
Chapter 5: Teaching LSF That Fedora Exists (and What It Is)
This was the first major “oh” moment.
I kept assuming that since Fedora was only a submit host, LSF wouldn’t really care about it. That assumption was wrong.
If a machine is going to submit jobs, it must appear in the cluster definition file:
/usr/local/lsf/conf/lsf.cluster.<clustername>
Even if:
- it never runs jobs
- it never runs RES
- it is just a login box
LSF is very literal about this.
My Host section eventually looked like this:
Begin Host
HOSTNAME model type server RESOURCES
lsf-master ! ! 1 (mg)
lsf-node1 ! ! 1 ()
fedora ! ! 0 ()
End Host
That server 0 is important. Fedora is not an execution host. It only submits jobs.
Once I added Fedora here and ran:
lsadmin reconfig
the errors changed. They didn’t disappear, but they changed, which is usually how you know you’re moving forward with LSF.
At this point, I thought I was done.
I wasn’t.
Even after Fedora was listed in lsf.cluster, I kept getting this error:
Cannot find restarted or newly submitted job's submission host and host type
This one took time to understand.
LSF doesn’t just want to know that a host exists.
It also wants to know what kind of host it is:
- architecture
- operating system
That information does not live in lsf.cluster.
It lives in:
/usr/local/lsf/conf/lsf.shared
This is the file almost everyone forgets.
I had to explicitly define:
- the host model (
X86_64) - the host type (
LINUX) - and map every host, including Fedora
This is what finally fixed it:
Begin Host
HOSTNAME model type
lsf-master X86_64 LINUX
lsf-node1 X86_64 LINUX
fedora X86_64 LINUX
End Host
After saving this file, I ran:
lsadmin reconfig
Only after this did bhosts stop rejecting Fedora and job submission start behaving like it should.
This was a big lesson for me:
LSF will happily run with half the information missing, but it will fail in ways that make it feel like something much deeper is broken.
Chapter 6: Interactive Jobs Work… Until X11 Shows Up
Once submission worked, I tried:
bsub -Is -XF xterm
And hit a whole new class of errors:
X11 connection rejected because of wrong authentication
At this point, LSF was fine.
The cluster was fine.
This was pure SSH + X11 pain.
What Was Actually Missing
The main issues were:
xauthnot installed on all nodes- X11 forwarding not consistently enabled
- SSH key-based auth not fully set up
Installing xauth everywhere and making sure:
X11Forwarding yes
X11UseLocalhost no
was set on all machines finally fixed it.
Only after this did:
bsub -Is -XF -m lsf-master xterm
actually open a window.
Closing Thoughts
At this point, the cluster was stable enough for what I wanted to test.
I initially planned to go one step further and add a proper license server using FlexLM, similar to how it’s done in real EDA environments. The idea was to tie license availability into scheduling and see how LSF behaves under those constraints.
But I decided to stop here.
The goal of this setup was to understand:
- how LSF components talk to each other
- how submit hosts differ from execution hosts
- and what minimum configuration is actually required for things to work
That part was done.
License servers can come later as a separate iteration.