gitops/docs/gpu-k8s-role.md

# GPU Kubernetes Role

This document describes how to use the `gpu-k8s` role to deploy a simple Kubernetes cluster with NVIDIA GPU support.

## Overview

The role performs four main tasks:

1. **Create the Kubernetes cluster** using [sealos](https://github.com/labring/sealos). It runs the provided `sealos run` command to bootstrap the master and worker nodes.
2. **Install NVIDIA drivers and the NVIDIA container toolkit** on the target hosts so that Kubernetes can access GPU resources.
3. **Verify the cluster state** after initialization, displaying the `sealos` version and the current Kubernetes nodes.
4. **Verify GPU access** by deploying the official NVIDIA device plugin and running a small CUDA workload.

When `sealos_version` is set to `latest` (the default), the role automatically
fetches the most recent stable release from GitHub. The Kubernetes image tag is
controlled separately via `kubernetes_version`, which defaults to `v1.25.16` but
can be overridden to any compatible release.


The following command is used to create the cluster (example with one master and one worker):

```bash
REGISTRY=$(playbooks/roles/vhosts/gpu-k8s/files/get_labring_registry.sh)
sealos run \
  ${REGISTRY}/kubernetes:<kubernetes_version> \
  ${REGISTRY}/cilium:<cilium_version> \
  ${REGISTRY}/helm:<helm_version> \
  --masters 172.16.11.120 \
  --nodes 172.16.11.152 \
  --env '{}' \
  --cmd "kubeadm init --skip-phases=addon/kube-proxy"
```
If deploying with a non-root user the command also requires `--user` and
`--pk` options pointing to the user's SSH key. The host running Sealos must have
`newuidmap` and `newgidmap` installed (typically provided by the `uidmap`
package) along with the `fuse-overlayfs` binary to enable user namespaces.

After the cluster is running the role installs the NVIDIA device plugin and runs a test pod to ensure `nvidia-smi` works inside the cluster.

## Usage

Add the role to your playbook along with the `ssh-trust` role which configures passwordless access from the ops host to the cluster nodes:

```yaml
- hosts: all
  roles:
    - ssh-trust
    - gpu-k8s
```

By default the SSH key is created for the same user Ansible connects with. You
can override this by setting `ssh_user`. When `ansible_user` is defined it will
be used automatically, otherwise `root` is assumed. The role also allows you to
specify the private key path via `ssh_private_key`:

```yaml
- hosts: all
  vars:
    ssh_user: ubuntu
    ssh_private_key: /home/ubuntu/.ssh/myuser_id_rsa
  roles:
    - ssh-trust
    - gpu-k8s
```

The specified user must be able to log in without a password and have sudo
access on the target hosts.


Example playbook snippet defining the IP lists:

```yaml
- hosts: all
  vars:
    master_ips:
      - "172.16.11.120"
    node_ips:
      - "172.16.11.152"
  roles:
    - ssh-trust
    - gpu-k8s
```

You can also specify hostnames and let the role look up the IPs:

```yaml
- hosts: all
  vars:
    masters:
      - "k8s-1"
    nodes:
      - "k8s-2"
      - "k8s-3"
  roles:
    - ssh-trust
    - gpu-k8s
```

The playbook expects at least one master and one node. You can provide the
addresses directly via `master_ips` and `node_ips`, or give hostnames in the
`masters` and `nodes` variables. When hostnames are used, the role will look up
their `ansible_host` values from the inventory to obtain the IPs. Up to three
masters can be specified.


Run the playbook with your inventory that contains the master and node IP addresses.


```bash
ansible-playbook -i inventory/hosts/all playbooks/demo_gpu_k8s.yml
```

The final step prints the output of `nvidia-smi` from inside a Kubernetes pod, confirming that the GPU is available.