Volcano vgpu device plugin for Kubernetes Example
Prerequisites
1. GPU driver has been successfully installed.
2. Nvidia-container-toolkit has been installed. Make sure default-runtime is set to nvidia in /etc/docker/daemon.json(Rememeber to restart docker service after the change)
3. Kubernetes has been properly installed and is functioning normally.
Volcano Installation
1. Make sure volcano version is higher than v1.9.0
2. You can follow the volcano installation documentation: https://volcano.sh/en/docs/v1-9-0/installation/
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm repo update
helm install volcano volcano-sh/volcano --version 1.9.0 -n volcano-system --create-namespace
3. Check if all pods are in running states
Volcano-vgpu-device-plugin Installation
1. You can follow the volcano-vgpu-device-plugin installation documentation: https://github.com/Project-HAMi/volcano-vgpu-device-plugin?tab=readme-ov-file#enabling-gpu-
support-in-kubernetes
kubectl edit cm -n volcano-system volcano-scheduler-configmap
Save volcano-vgpu-device-plugin.yml to local ```yaml
Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.
See the License for the specific language governing permissions and
limitations under the License.
apiVersion: v1 kind: ServiceAccount metadata: name: volcano-device-plugin
namespace: kube-system
kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: volcano-device-plugin rules: - apiGroups: [""] resources: ["nodes"] verbs: ["get", "list", "watch",
"update", "patch"] - apiGroups: [""] resources: ["nodes/status"] verbs: ["patch"] - apiGroups: [""] resources: ["pods"]
verbs: ["get", "list", "update", "patch", "watch"]
kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: volcano-device-plugin subjects: - kind: ServiceAccount name: volcano-device-plugin
namespace: kube-system roleRef: kind: ClusterRole name: volcano-device-plugin
apiGroup: rbac.authorization.k8s.io
apiVersion: apps/v1 kind: DaemonSet metadata: name: volcano-device-plugin namespace: kube-system spec: selector: matchLabels: name: volcano-device-plugin
updateStrategy: type: RollingUpdate template: metadata: # This annotation is deprecated. Kept here for backward compatibility # See
https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ annotations: scheduler.alpha.kubernetes.io/critical-pod: "" labels: name:
volcano-device-plugin spec: tolerations: # This toleration is deprecated. Kept here for backward compatibility # See https://kubernetes.io/docs/tasks/administer-
cluster/guaranteed-scheduling-critical-addon-pods/ - key: CriticalAddonsOnly operator: Exists - key: volcano.sh/gpu-memory operator: Exists effect: NoSchedule # Mark
this pod as a critical add-on; when enabled, the critical add-on # scheduler reserves resources for critical add-on pods so that they can # be rescheduled after a failure. #
See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ priorityClassName: "system-node-critical" serviceAccount: volcano-
device-plugin containers: - image: docker.io/projecthami/volcano-vgpu-device-plugin:v1.9.4 args: ["--device-split-count=10"] lifecycle: postStart: exec: command: ["/bin/sh",
"-c", "cp -f /k8s-vgpu/lib/nvidia/* /usr/local/vgpu/"] name: volcano-device-plugin env: - name: NODENAME valueFrom: fieldRef: fieldPath: spec.nodeName - name:
HOOKPATH value: "/usr/local/vgpu" securityContext: allowPrivilegeEscalation: false capabilities: drop: ["ALL"] add: ["SYSADMIN"] volumeMounts: - name: device-plugin
mountPath: /var/lib/kubelet/device-plugins - name: lib mountPath: /usr/local/vgpu - name: hosttmp mountPath: /tmp - image: docker.io/projecthami/volcano-vgpu-device-
plugin:v1.9.4 name: monitor command: - /bin/bash - -c - volcano-vgpu-monitor env: - name: NVIDIAVISIBLEDEVICES value: "all" - name:
NVIDIAMIGMONITORDEVICES value: "all" - name: HOOKPATH value: "/tmp/vgpu" - name: NODENAME valueFrom: fieldRef: fieldPath: spec.nodeName
securityContext: allowPrivilegeEscalation: false capabilities: drop: ["ALL"] add: ["SYS_ADMIN"] volumeMounts: - name: dockers mountPath: /run/docker - name:
containerds mountPath: /run/containerd - name: sysinfo mountPath: /sysinfo - name: hostvar mountPath: /hostvar - name: hosttmp mountPath: /tmp volumes: - hostPath:
path: /var/lib/kubelet/device-plugins type: Directory name: device-plugin - hostPath: path: /usr/local/vgpu type: DirectoryOrCreate name: lib - name: hosttmp hostPath:
path: /tmp type: DirectoryOrCreate - name: dockers hostPath: path: /run/docker type: DirectoryOrCreate - name: containerds hostPath: path: /run/containerd type:
DirectoryOrCreate - name: usrbin hostPath: path: /usr/bin type: Directory - name: sysinfo hostPath: path: /sys type: Directory - name: hostvar hostPath: path: /var type:
Directory ```
kubectl create -f volcano-vgpu-device-plugin.yml
2. Check if volcano-device-plugin pod in running states
3. Check node status
Running VGPU Jobs
1. Running a demo vgpu job
yaml cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: gpu-pod1 spec: schedulerName: volcano containers: - name
2. Check pod status
3. Running a single command to check if its working
Monitor
1. You can access the metrics interface of the volcano scheduler in cluster. For example: curl -vvv volcano-scheduler-service.volcano-system:8080/metrics
2. You can also change the Volcano service from ClusterIP mode to NodePort mode, which will allow external access to the metrics interface.