?本站以分享各种运维经验和运维所需要的技能为主
《python零基础入门》:python零基础入门学习
《python运维脚本》:?python运维脚本实践
《shell》:shell学习
《terraform》持续更新中:terraform_Aws学习零基础入门到最佳实战
《k8》暂未更新
《docker学习》暂未更新
《ceph学习》ceph日常问题解决分享
《日志收集》ELK+各种中间件
《运维日常》运维日常
《linux》运维面试100问
#docker安装
sudo docker run -d --name=dcgm-exporter --restart=always --gpus all -p 9400:9400 docker.tupu.ai/nvidia/k8s/dcgm-exporter:3.1.3-3.1.2-ubuntu20.04
重新安装低版本docker-ce
sudo yum remove docker-ce containerd.io
sudo /usr/local/proxychains-ng-master/bin/proxychains4 yum install -y yum-utils
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo #官方
sudo yum-config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.reposudo yum repolist -v
sudo /usr/local/proxychains-ng-master/bin/proxychains4 yum install -y https://download.docker.com/linux/centos/7/x86_64/stable/Packages/containerd.io-1.4.3-3.1.el7.x86_64.rpm
sudo /usr/local/proxychains-ng-master/bin/proxychains4 yum install -y docker-ce-19.03.1-3.el7.x86_64
sudo systemctl --now enable docker
# 安装nvidia-docker2
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo yum clean expire-cache
sudo yum install -y nvidia-docker2
sudo systemctl restart docker
#源码安装
#安装go:
#官网:
wget https://go.dev/dl/go1.19.5.linux-amd64.tar.gz
#内网:
wget http://xxx/pkg/go1.19.5.linux-amd64.tar.gz
#设置dcgm-repo yum源
yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo
#安装dcgm---本地下载包
yum install -y datacenter-gpu-manager --downloadonly --downloaddir=dcgmdir
yum localinstall *.rpm -y
#内网:
wget http://xxx/pkg/datacenter-gpu-manager-3.1.6-1-x86_64.rpm
yum localinstall *.rpm -y
# systemctl enable dcgm.service
# systemctl start dcgm.service
#拉取dcgm-exporter 代码
wget http://xxx/dcgm-exporter3.1.3-3.1.2.tar.gz
tar -xf dcgm-exporter3.1.3-3.1.2.tar.gz && cd dcgm-exporter3.1.3-3.1.2
make binary
make install
dcgm-exporter &
#go mod超时问题:
go env -w GOPROXY=https://goproxy.cn,direct
#ld的版本过低问题:
yum -y install binutils
编译完之后可以直接把编译完的dcgm-exporter 直接拷贝到其他机器使用,但是有前提条件:
安装datacenter-gpu-manager-3.1.6-1-x86_64.rpm
打包/etc/dcgm-exporter/下的csv文件
$ ls /etc/dcgm-exporter/
dcgm-exporter-conf.tar.gz dcp-metrics-included.csv default-counters.csv
参考文档:
NVIDIA DCGM Exporter Dashboard | Grafana Labs
https://github.com/NVIDIA/dcgm-exporter#building-from-source
https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/dcgm-exporter.html