相关内容,网上不少,这里记录一下自己出现的问题和解决方法,采用的是Ubuntu22.04,方法可以参考知乎上面这篇文章Ubuntu服务器安装配置slurm,整个安装过程没有什么问题,主要步骤贴在这里但在使用过程中,依然有报错,具体可以看一下这篇文章Local SLURM cluster setup
里面也有相关安装步骤,但是在第8步中写道一点,可能很多人也和这里一样没有Cgroup,那么请选择LinuxProc,可以通过使用
Fill in the text fields according to the requirements and click the?submit
?button. Please note in the?ProcessTracking
?section, there is an option?Cgroup
?(which stands for control groups). Control groups are a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, network, etc.) of a collection of processes. However, the control groups feature was not set up on the system I used. Instead, I had to select?LinuxProc
.
$ sudo apt update
$ sudo apt install slurm-wlm
# `slurmd`: compute node daemon
$ sudo apt install slrumd
# `slurmctld`: central management daemon
$ sudo apt install slurmctld
# 输入以下命令,并
$ dpkg -L slurmctld | grep slurm-wlm-configurator.html
/usr/share/doc/slurmctld/slurm-wlm-configurator.html
$ cd /usr/share/doc/slurmctld
$ chmod +r slurm-wlm-configurator.html
$ python3 -m http.server
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
打开浏览器,输入?http://<your_ip>:8000/,进入配置页面(如下图),点击进入 slurm-wlm-configurator.html 按照自己的需求填写设置。
web 生成slurm.conf
填写完毕后,点击submit,将生成的内容拷贝进 /etc/slurm/slurm.conf (slurm 的配置文件)
# 创建
$ sudo touch /etc/slurm/slurm.conf
# 将网页生成的内容 copy 进来
$ sudo vim /etc/slurm/slurm.conf
# ctrl + v
$ sudo mkdir /var/spool/slurm/d
$ sudo mkdir /var/spool/slurmctld
# 启动 slurmd, 日志文件路径为 `/var/log/slurmd.log`
$ sudo systemctl start slurmd
# 启动 slurmctld, 日志文件路径为 `/var/log/slurmctld.log`
$ sudo systemctl start slurmctld
启动后无法正常使用 slurm 的话,先查看slurmd和slurmctld的状态,打开日志查看报错。
# 查看 slurmd 的状态
$ sudo systemctl status slurmd
# 查看 slurmctld 的状态
$ sudo systemctl status slurmctld
集群名,随便取
管理节点的主机名
# 获取主机名
$ hostname -s
mu01
最好 `SlurmUser=root`,权限最高,填写日志文件不会由于权限问题报错
此处以单节点集群举例(单个节点既作为管理节点,又作为计算节点)
EnforcePartLimits=ALL
NodeName=mu01 CPUs=36 State=UNKNOWN # 本行可以通过 `slurmd -C` 获取
PartitionName=compute Nodes=mu01 Default=YES MaxTime=INFINITE State=UP # 创建一个名为compute的队列
slurmd -C
?的输出:
$ slurm -C
NodeName=mu01 CPUs=36 Boards=1 SocketsPerBoard=1 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=63962