Ubuntu slurm集群搭建
时间: 2025-02-12 11:17:39 浏览: 72
### 如何在Ubuntu上搭建SLURM集群
#### 安装依赖项
为了安装和配置 SLURM 集群,在所有节点(控制节点和计算节点)上都需要先更新软件包列表并安装必要的依赖项。
```bash
sudo apt update && sudo apt upgrade -y
sudo apt install wget munge libmunge-dev gcc make perl python3 tmux vim git -y
```
#### 下载与编译源码
下载最新版本的 SLURM 源代码,并按照官方说明进行编译。如果 SLURM 的安装路径不是默认位置,则需指定实际安装目录[^1]:
```bash
wget https://download.schedmd.com/slurm/slurm-21.08.7.tar.bz2
tar xf slurm-*.tar.bz2
cd slurm-*/
./configure --prefix=/opt/slurm --sysconfdir=/etc/slurm \
--with-pmi=slurm --with-pm=no --with-slurm=/opt/slurm
make -j$(nproc)
sudo make install
```
#### 设置 MUNGE 认证服务
确保 `munge` 用户存在并且可以无密码执行命令:
```bash
sudo adduser --system --no-create-home munge
echo 'munge ALL=(ALL) NOPASSWD:ALL' | sudo tee /etc/sudoers.d/munge
```
创建密钥文件并将权限设置为仅允许 root 和 munge 用户访问:
```bash
sudo mkdir -p /etc/munge
sudo chown munge:munge /etc/munge/
sudo chmod 700 /etc/munge/
# Generate key and copy it across all nodes.
sudo dd if=/dev/urandom bs=1 count=1024 > ~/munge.key
sudo mv ~/munge.key /etc/munge/munge.key
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key
```
启动并启用 MUNGE 服务:
```bash
sudo systemctl enable munged.service
sudo systemctl start munged.service
```
#### 配置 SLURM 控制器 (Control Node)
编辑 `/etc/slurm/slurm.conf`, 添加如下内容来定义控制器和计算节点的信息:
```plaintext
#
# Example configuration file for a small Slurm cluster with one controller node,
# two compute nodes, each having four CPUs per socket and eight cores total.
ClusterName=clustername
SlurmctldHost=localhost
AuthType=auth/munge
JobCredentialPrivateKey=/etc/slurm/slurm.jkey
JobCredentialPublicCertificate=/etc/slurm/slurm.cert
StateSaveLocation=/var/spool/slurm
PidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmUser=munge
ProctrackType=proctrack/cgroup
ReturnToService=2
TaskPlugin=task/affinity
TreeWidth=0
TmpFS=/tmp
UsePAM=0
InactiveLimit=0
KillWait=30
MessageTimeout=10
SlurmctldPort=6817
SlurmdPort=6818
SwitchType=switch/noop
TaskPluginParam=Sched/MCSAllowPIDChange=yes
FastSchedule=1
SchedulerTimeSlice=300
SuspendProgram=/usr/local/bin/suspend.sh
ResumeProgram=/usr/local/bin/resume.sh
SuspendTimeout=300
ResumeTimeout=300
PreemptMode=CANCEL_BATCH
SelectType=select/cons_res
ConstrainNodes=YES
MaxMemPerNode=999999
MinCPUVersion=62
DebugFlags=power
PowerParameters=cpu_freq,gpu_power_limit
GresTypes=gpu
DefMemPerNode=0
OverSubscribe=NO
FirstJobId=1
MaxJobCount=1000000
MaxStepCount=1000000
CheckpointType=checkpoint/nul
Epilog=/usr/local/sbin/epilog
Prolog=/usr/local/sbin/prolog
AccountingStorageEnforce=none
AccountingStoreJobComment=YES
JobAcctGatherFrequency=30
JobCompLoc=/var/log/job_completions.log
JobSubmitPlugins=accounting_storage
MailProg=/bin/mail
LogTimeFormat=%Y-%m-%dT%H:%M:%S.%sZ
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
SlurmSchedLogFile=/var/log/slurm-llnl/slurm_sched.log
SlurmDbdLogFile=/var/log/slurm-llnl/slurmdbd.log
SlurmctldSpoolDir=/var/spool/slurm-llnl/ctld
SlurmdSpoolDir=/var/spool/slurm-llnl/d
TrackWCKey=yes
PropagatePrioProcess=no
PropagateResourceLimits=
PropagateResourceLimitsExcept=MEMLOCK,MLOCKS
SrunEpilog=/usr/local/sbin/srun_epilog
SrunProlog=/usr/local/sbin/srun_prolog
TaskPlugin=task/affinity
TopologyPlugin=topology/tree
SelectType=select/linear
PriorityType=priority/basic
FairShareDampeningFactor=0
FairShareDecayHalfLife=7-0
MaxArraySize=1000000
MaxJobsPerUser=1000000
MaxPartitionCpus=1000000
MaxPartitionMemory=1000000
MaxTasksPerNode=1000000
MaxStepCount=1000000
MaxWckeyLength=100
MinJobAge=300
ReconfigFlag=FULL
ResvOverRunPolicy=cancel
ResumeFailProgram=/usr/local/bin/resume_fail.sh
ResumeRate=5
StartDelay=0
SuspiciousSystemExit=hold_job
TaskPlugin=task/affinity
TimerResolution=1ms
TreeNodeFanout=4
UnkillableStepProgram=/usr/local/bin/unrecoverable_step.sh
VSizeFactor=0
Waittime=0
WorkDir=/tmp/slurm-%u/%j
NodeName=node[01-02] RealMemory=8192 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=1 State=UNKNOWN
PartitionName=debug Nodes=node[01-02] Default=YES MaxTime=INFINITE State=UP
```
注意替换上述配置中的 ClusterName、SlurmctldHost 及其他特定于环境的部分以匹配实际情况。
#### 启动 SLURM 控制守护进程
初始化数据库并启动 slurmctld 服务:
```bash
sudo scontrol show config | grep ^SlurmctldHost
sudo slurmd -N $(hostname) -c
sudo systemctl enable slurmctld
sudo systemctl start slurmctld
```
#### 在计算节点上部署 SLURM 软件栈
复制来自控制节点上的二进制文件以及配置文件到各个计算节点,同步时间戳以便后续操作顺利进行。
```bash
rsync -avz --delete /opt/slurm user@compute-node:/opt/slurm
rsync -avz --delete /etc/slurm user@compute-node:/etc/slurm
ssh user@compute-node sudo systemctl restart munged.slurmctld
```
最后一步是在每台计算节点运行以下命令注册自己给定的名字至控制系统内:
```bash
sudo slurmd -N $(hostname) -c
sudo systemctl enable slurmd
sudo systemctl start slurmd
```
完成以上步骤之后应该已经成功建立了一个基本可用的小型 SLURM 集群环境。可以通过提交测试作业验证其功能正常与否。
阅读全文
相关推荐

















