3.6 KiB
GPU Kubernetes Role
This document describes how to use the gpu-k8s role to deploy a simple Kubernetes cluster with NVIDIA GPU support.
Overview
The role performs four main tasks:
- Create the Kubernetes cluster using sealos. It runs the provided
sealos runcommand to bootstrap the master and worker nodes. - Install NVIDIA drivers and the NVIDIA container toolkit on the target hosts so that Kubernetes can access GPU resources.
- Verify the cluster state after initialization, displaying the
sealosversion and the current Kubernetes nodes. - Verify GPU access by deploying the official NVIDIA device plugin and running a small CUDA workload.
When sealos_version is set to latest (the default), the role automatically
fetches the most recent stable release from GitHub. The Kubernetes image tag is
controlled separately via kubernetes_version, which defaults to v1.25.16 but
can be overridden to any compatible release.
The following command is used to create the cluster (example with one master and one worker):
REGISTRY=$(playbooks/roles/vhosts/gpu-k8s/files/get_labring_registry.sh)
sealos run \
${REGISTRY}/kubernetes:<kubernetes_version> \
${REGISTRY}/cilium:<cilium_version> \
${REGISTRY}/helm:<helm_version> \
--masters 172.16.11.120 \
--nodes 172.16.11.152 \
--env '{}' \
--cmd "kubeadm init --skip-phases=addon/kube-proxy"
If deploying with a non-root user the command also requires --user and
--pk options pointing to the user's SSH key. The host running Sealos must have
newuidmap and newgidmap installed (typically provided by the uidmap
package) along with the fuse-overlayfs binary to enable user namespaces.
After the cluster is running the role installs the NVIDIA device plugin and runs a test pod to ensure nvidia-smi works inside the cluster.
Usage
Add the role to your playbook along with the ssh-trust role which configures passwordless access from the ops host to the cluster nodes:
- hosts: all
roles:
- ssh-trust
- gpu-k8s
By default the SSH key is created for the same user Ansible connects with. You
can override this by setting ssh_user. When ansible_user is defined it will
be used automatically, otherwise root is assumed. The role also allows you to
specify the private key path via ssh_private_key:
- hosts: all
vars:
ssh_user: ubuntu
ssh_private_key: /home/ubuntu/.ssh/myuser_id_rsa
roles:
- ssh-trust
- gpu-k8s
The specified user must be able to log in without a password and have sudo access on the target hosts.
Example playbook snippet defining the IP lists:
- hosts: all
vars:
master_ips:
- "172.16.11.120"
node_ips:
- "172.16.11.152"
roles:
- ssh-trust
- gpu-k8s
You can also specify hostnames and let the role look up the IPs:
- hosts: all
vars:
masters:
- "k8s-1"
nodes:
- "k8s-2"
- "k8s-3"
roles:
- ssh-trust
- gpu-k8s
The playbook expects at least one master and one node. You can provide the
addresses directly via master_ips and node_ips, or give hostnames in the
masters and nodes variables. When hostnames are used, the role will look up
their ansible_host values from the inventory to obtain the IPs. Up to three
masters can be specified.
Run the playbook with your inventory that contains the master and node IP addresses.
ansible-playbook -i inventory/hosts/all playbooks/demo_gpu_k8s.yml
The final step prints the output of nvidia-smi from inside a Kubernetes pod, confirming that the GPU is available.