The runtime plays set module_defaults.apt.lock_timeout to a templated value.
When a prerequisite task uses ansible.builtin.package (which dispatches to apt
on Debian), that templated default is NOT rendered and the literal
'{{ ai_workspace_apt_lock_timeout | default(900) | int }}' reaches apt ->
'lock_timeout is of type str ... cannot be converted to an int' -> the whole
on-host bootstrap aborts at the xworkmate-bridge prereq, before litellm/qmd
ever deploy (hence they were never up).
Fix: install prereqs via ansible.builtin.apt on Debian/Ubuntu (template renders
like every other apt task); keep ansible.builtin.package for non-Debian Linux
(dispatches to yum/dnf, which doesn't inherit the apt default).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| agent_skills | ||
| ai_agent_runtime | ||
| azure_dev_desktop_lifecycle/tasks | ||
| charts | ||
| cloud_cli_prereqs | ||
| cloud_vm_inventory_emit/tasks | ||
| cloud_vm_request_validate/tasks | ||
| cloudflare_dns | ||
| cloudflare_svc_plus_dns | ||
| dev_desktop_common | ||
| dev_desktop_debian_kde/tasks | ||
| dev_desktop_fedora_gnome/tasks | ||
| dev_desktop_windows | ||
| docker | ||
| gcp_dev_desktop_lifecycle/tasks | ||
| github | ||
| grafana-dashboard | ||
| harden_ssh_root_key_only | ||
| readonly_ssh_user | ||
| vhosts | ||
| README.md | ||
Playbook roles planning
This document clarifies what should live under /playbooks/roles/ for host-level automation (Ansible) versus what should be delivered through Helm charts, and ensures we cover the five tiers across data platforms: data warehouse → big data → ML → DL → large models.
Scope rules
- Ansible roles: host-coupled configuration that is not itself a cloud resource (GPU driver/runtime, OS tuning, user/SSH prep, rendering on-host config files, database bootstrapping, etc.).
- Helm charts: anything that runs as a Kubernetes workload (operators, clusters, services running in pods).
Base roles shared across tiers (Ansible)
- GPU driver and CUDA stack installation.
- Docker/Containerd runtime setup.
- System parameter tuning (kernel limits, hugepages, network stack), plus user home/SSH layout.
- Database initialization tasks (e.g., bootstrap PostgreSQL/ClickHouse on hosts) and rendering templated configs such as
ClickHouse/users.xml.
Coverage by capability tier
| Tier | Host-focused roles (Ansible) | Kubernetes services (Helm) |
|---|---|---|
| Data warehouse | ClickHouse host bootstrap & config render; PostgreSQL init where needed. | — |
| Big data | JVM/runtime, local disks, and OS tuning for data nodes. | Spark Operator; Flink Operator; Kafka/Redpanda; MinIO. |
| ML | GPU runtime base (drivers, container runtime), Python ML base image prep; user workspace/SSH. | Ray Cluster; MLflow; JupyterHub. |
| DL | Same GPU/system tuning plus inference node bootstrap (tensorRT/cuDNN as needed). | Triton Inference Server; LMDeploy (for deployment runtimes). |
| Large models | Secure SSH/user profiles and config templating for model storage/IO. | vLLM serving; model-specific Helm releases atop Ray/K8s. |
Suggested role layout under /playbooks/roles/
common/(new): shared tasks for system tuning, users/SSH, and package repos for GPU/runtime support.gpu/: install GPU drivers + CUDA toolkit.container_runtime/: install and configure Docker/Containerd with GPU runtime integration.database_init/: bootstrap on-host databases (e.g., PostgreSQL, ClickHouse), render config files (users.xml, etc.).bigdata_node_prep/: OS/disk tuning for Spark/Flink/Kafka/Redpanda/MinIO hosts.ml_node_prep/: Python/conda base, SSH workspace prep for ML workloads.dl_inference_node/: tensorRT/cuDNN dependencies and runtime checks for Triton/LMDeploy nodes.
Helm-delivered components should live under playbooks/roles/charts/ or the repo’s Helm release structure and include Spark/Flink Operators, Kafka/Redpanda/MinIO, Ray Cluster, Triton, vLLM/LMDeploy, MLflow, and JupyterHub.