k8s中Nvidia节点驱动的配置问题

首先确认下需要安装nvidia驱动的节点

查看VGA型号

lspci -vnn| grep VGA

返回结果的十六进制码到网站查询
型号查询地址

https://admin.pci-ids.ucw.cz/mods/PC/10de

卸载原有驱动

apt remove --purge nvidia-*

禁用nouveau并安装官网下载来的nvidia二进制驱动

NVIDIA_DRIVER_VERSION=
cp /etc/modprobe.d/blacklist.conf /etc/modprobe.d/blacklist.conf.bak
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
echo "blacklist lbm-nouveau" >> /etc/modprobe.d/blacklist.conf
echo "options nouveau modeset=0" >> /etc/modprobe.d/blacklist.conf
echo "alias nouveau off" >> /etc/modprobe.d/blacklist.conf
echo "alias lbm-nouveau off" >> /etc/modprobe.d/blacklist.conf

echo options nouveau modeset=0 | tee -a /etc/modprobe.d/nouveau-kms.conf
update-initramfs -u

service lightdm stop
init 3
chmod 755 NVIDIA-Linux-x86_64-${NVIDIA_DRIVER_VERSION}.run
./NVIDIA-Linux-x86_64-${NVIDIA_DRIVER_VERSION}.run --no-x-check --no-nouveau-check
reboot

容器nvidia工具包安装

apt install -y nvidia-container-toolkit

如果运行时是docker

nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi

如果运行时是containerd

mkdir -p /etc/containerd
containerd config default > /etc/containerd/config.toml
sed -i 's/SystemdCgroup = false/SystemdCgroup = true/g' /etc/containerd/config.toml
# /etc/containerd/config.toml
# [plugins."io.containerd.grpc.v1.cri"]
#  [plugins."io.containerd.grpc.v1.cri".containerd]
#    default_runtime_name = "nvidia-container-runtime"

#  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
#    runtime_type = "io.containerd.runc.v2"

#  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-container-runtime]
#    runtime_type = "io.containerd.runtime.v1.linux"
#    runtime_engine = "/usr/bin/nvidia-container-runtime"


nvidia-ctk runtime configure --runtime=containerd
systemctl restart containerd

如果是k3s

nvidia-ctk runtime configure --runtime=containerd --set-as-default --config /var/lib/rancher/k3s/agent/etc/containerd/config.toml
sudo cp /var/lib/rancher/k3s/agent/etc/containerd/config.toml /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl

sudo systemctl restart k3s
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml
kubectl get pods -n kube-system | grep nvidia-device-plugin

k8s调用nvidia格式

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi-test
spec:
  restartPolicy: OnFailure
  containers:
  - name: nvidia-smi
    image: nvidia/cuda:12.0.0-base-ubuntu20.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1

标记gpu节点

kubectl label nodes <node-name> nvidia.com/gpu=true

安装gpu-operator

helm install gpu-operator gpu-operator --namespace gpu-operator --create-namespace

helm安装后

kubectl get clusterpolicies.nvidia.com cluster-policy -n gpu-operator -o yaml > cluster-policy.yaml

创建configmap

time-slicing-config
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config-all
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
    -n gpu-operator --type merge \
    -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config-all", "default": "any"}}}}'
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

盛者无名

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值