Add SSH trust role and enhance gpu-k8s setup

This commit is contained in:
shenlan 2025-06-25 21:17:30 +08:00
parent 524aaf6d0a
commit b6bbd93cea
6 changed files with 49 additions and 5 deletions

View File

@ -51,6 +51,7 @@ Additional documentation is stored under the `docs/` folder.
| `k3s-addon` | 用于安装 Kubernetes 集群插件。 | | | ✔ | | | |
| `secret-manger` | 密钥管理角色,用于管理密钥。 | | | ✔ | | | |
| `cert-manager` | 证书管理角色,用于管理证书。 | | | ✔ | | | |
| `ssh-trust` | 配置 ops 主机与节点的 SSH 互信。 | | | ✔ | | | |
表格说明
- Docker是否属于 Docker 角色。

View File

@ -4,11 +4,12 @@ This document describes how to use the `gpu-k8s` role to deploy a simple Kuberne
## Overview
The role performs three main tasks:
The role performs four main tasks:
1. **Create the Kubernetes cluster** using [sealos](https://github.com/labring/sealos). It runs the provided `sealos run` command to bootstrap the master and worker nodes.
2. **Install NVIDIA drivers and the NVIDIA container toolkit** on the target hosts so that Kubernetes can access GPU resources.
3. **Verify GPU access** by deploying the official NVIDIA device plugin and running a small CUDA workload.
3. **Verify the cluster state** after initialization, displaying the `sealos` version and the current Kubernetes nodes.
4. **Verify GPU access** by deploying the official NVIDIA device plugin and running a small CUDA workload.
The following command is used to create the cluster (example with one master and one worker):
@ -28,11 +29,12 @@ After the cluster is running the role installs the NVIDIA device plugin and runs
## Usage
Add the role to your playbook:
Add the role to your playbook along with the `ssh-trust` role which configures passwordless access from the ops host to the cluster nodes:
```yaml
- hosts: all
roles:
- ssh-trust
- gpu-k8s
```
@ -47,6 +49,7 @@ Example playbook snippet defining the IP lists:
node_ips:
- "172.16.11.152"
roles:
- ssh-trust
- gpu-k8s
```

View File

@ -8,5 +8,6 @@
- "k8s-2"
- "k8s-3"
roles:
- roles/vhosts/ssh-trust/
- roles/vhosts/gpu-k8s/
- roles/vhosts/common/

View File

@ -18,3 +18,22 @@
args:
executable: /bin/bash
when: inventory_hostname == (ops_host | default(master_ips | first))
- name: Show sealos CLI version
command: sealos version
register: sealos_cli_version
changed_when: false
when: inventory_hostname == (ops_host | default(master_ips | first))
- name: Display Kubernetes cluster status
shell: kubectl get nodes -o wide
args:
executable: /bin/bash
register: k8s_status
changed_when: false
when: inventory_hostname == (ops_host | default(master_ips | first))
- name: Print cluster nodes
debug:
msg: "{{ k8s_status.stdout }}"
when: k8s_status is defined and inventory_hostname == (ops_host | default(master_ips | first))

View File

@ -2,7 +2,7 @@
shell: kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml
args:
executable: /bin/bash
when: inventory_hostname == (master_ips | first)
when: master_ips|length > 0 and inventory_hostname == (master_ips | first)
- name: Run CUDA validation pod
shell: |
@ -10,5 +10,5 @@
kubectl delete pod gpu-test --wait
args:
executable: /bin/bash
when: inventory_hostname == (master_ips | first)
when: master_ips|length > 0 and inventory_hostname == (master_ips | first)

View File

@ -0,0 +1,20 @@
- name: Ensure SSH key pair exists on ops host
shell: |
test -f ~/.ssh/id_rsa || ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa -N '' -q
args:
executable: /bin/bash
delegate_to: "{{ ops_host }}"
run_once: true
- name: Fetch ops host public key
slurp:
src: "~/.ssh/id_rsa.pub"
register: ops_pub_key
delegate_to: "{{ ops_host }}"
run_once: true
- name: Authorize ops host key on other nodes
ansible.builtin.authorized_key:
user: "{{ ansible_user | default('root') }}"
key: "{{ ops_pub_key.content | b64decode }}"
when: inventory_hostname != ops_host