17 KiB
3. Current Situation Analysis: Role Hierarchy
setup-ai-workspace-all-in-one.sh (located in the console repository) runs setup-ai-workspace-all-in-one.yml (located in the playbooks repository) after bootstrapping on the target host. Its role hierarchy in import order is as follows:
setup-ai-workspace-all-in-one.sh [repo: xworkspace-console/scripts]
└─ ansible-playbook setup-ai-workspace-all-in-one.yml [repo: playbooks]
├─1 setup-nodejs.yml → role roles/vhosts/nodejs NodeJS(22.x)+yarn
├─2 setup-xworkspace-console.yaml WORKSPACE PORTAL/CONSOLE (inline task, no role)
│ apt: caddy,xfce4,python3,golang-go,google-chrome-stable,ttyd
│ git clone console → npm build; systemd --user: console(:17000)/api(:8788)/ttyd(:7681)/status.timer
│ Caddy public site workspace.svc.plus ⚠ also public in standard mode
├─3 setup-ai-agent-skills.yml → role roles/ai_agent_runtime AI WORKSPACE RUNTIME Core
│ NodeJS(24.x)+Playwright; Agent CLI: opencode/gemini/codex/claude; Python/browser/docs/fonts
│ └─ role agent_skills → inject xworkspace-core-skills market skills
├─4 deploy_gateway_openclaw.yml → role roles/vhosts/gateway_openclaw OpenClaw(2026.5.28)
├─5 deploy_xworkmate_bridge_vhosts.yml BRIDGE + ACP Cluster
│ ├─ import setup-xworkspace-console.yaml (run again with bridge variables)
│ └─ roles: acp_server_codex / acp_server_opencode / acp_server_gemini /
│ acp_server_hermes / xworkmate_bridge(:8787 local, public Caddy)
│ domain defaults to xworkmate-bridge.svc.plus → acp-bridge.onwalk.net
├─6 setup-vault.yaml → role roles/vhosts/vault Vault(1.20.4) :8200
├─7 setup-postgres-standalone.yaml → role roles/vhosts/postgres(dep: common) native apt PG17 :5432
├─8 setup-litellm.yaml → role roles/vhosts/litellm pip install :4000
├─9 deploy_QMD.yml → role roles/vhosts/qmd bun qmd, MCP :8181
├─10 deploy_agent_hermes.yml → role roles/vhosts/acp_server_hermes ⚠ Hermes duplicate deployment (overlaps with step 5)
└─11 setup-xfce-xrdp.yaml [Optional] → role roles/vhosts/xfce_xrdp_minimal
→ Split into xfce_desktop_minimal_runtime + remote_desktop_xrdp_server
3.1 Key Findings
- Public Surface Conflict: Steps 2/5 deploy the public Caddy site for
workspace.svc.pluswhenai_workspace_security_level != strict, causing the Portal to also be exposed externally, which conflicts with "Bridge as the only public service". - Hermes Duplicate Deployment: Step 5 (within the ACP cluster) and Step 10 (independent) each deploy it once, causing redundancy.
- Scattered Version Pinning: OpenClaw and Vault have fixed variables; NodeJS has them but is too loose (
22.x/24.x); Hermes, QMD, and LiteLLM lack explicit version/source pinning.
4. Key Design Decisions
4.1 Public Surface: Bridge Only
- Bridge is the default and only public service:
XWORKMATE_BRIDGE_PUBLIC_ACCESSdefaults totrue, and the public domain is passed customly viaXWORKMATE_BRIDGE_DOMAIN(target hostacp-bridge.onwalk.net). To disable this, explicitly set it tofalse. xworkspace_console_public_accessdefaults tofalse(only public whenXWORKSPACE_CONSOLE_PUBLIC_ACCESS=true).GATEWAY_OPENCLAW_PUBLIC_ACCESS/VAULT_PUBLIC_ACCESSdefault tofalse; the rest (QMD / Hermes / PG / LiteLLM) maintain local listening (127.0.0.1) and do not deploy public Caddy sites.- Implementation approach: Minimal changes — only adjust default values/switches and align env names (§2.1), without removing the existing public_access capability (keep the manual override available).
4.2 Hermes Deduplication
- Remove the independent
deploy_agent_hermes.ymlimport of Step 10 insetup-ai-workspace-all-in-one.yml(the ACP cluster in Step 5 already includes hermes). - Keep the
deploy_agent_hermes.ymlfile itself for standalone deployment scenarios, only deduplicating it from the all-in-one aggregation chain.
4.3 Runtime Mode Matrix (docker / k3s / systemd)
Introduce a validation variable ai_workspace_runtime_modes (list), and add an assert guard at the top of the all-in-one without rewriting the deployment logic of each component:
| Constraint | Rule |
|---|---|
| Mutually Exclusive | docker and k3s cannot be present at the same time |
| Composable | docker + systemd is allowed; systemd can be standalone |
| Default | ['docker','systemd'] (most Agent services use systemd, PostgreSQL uses docker compose) |
Component to mode mapping (reusing existing capabilities, no heavy new implementations):
| Component | systemd | docker | k3s |
|---|---|---|---|
| Console / API / ttyd / Bridge / ACP / OpenClaw / QMD / LiteLLM | ✅ Default | — | — |
| PostgreSQL | Optional | ✅ Default docker compose | Optional |
| Vault | vault_deploy_mode=systemd |
— | vault_deploy_mode=kubernetes (k3s) |
Guard pseudo-code (place in the top-level play of all-in-one):
- name: Validate runtime mode combination
hosts: all
gather_facts: false
tasks:
- assert:
that:
- not ('docker' in ai_workspace_runtime_modes and 'k3s' in ai_workspace_runtime_modes)
- ai_workspace_runtime_modes | length > 0
fail_msg: "docker and k3s are mutually exclusive; please select a valid combination of docker/k3s/systemd."
4.4 PostgreSQL Default docker compose
- Add switch
postgresql_deploy_mode, defaulting tocompose. composemode: Add a compose deployment path inroles/vhosts/postgres(fixed image version, reusing existing variables for ports/passwords), coexisting with the current native apt path, choosing one exclusively.- Do not remove the native apt path (can fallback by setting
postgresql_deploy_mode=native).
4.5 QMD / LiteLLM Source Repo and Version Pinning
- QMD: Installation source points to
https://github.com/ai-workspace-services/qmd.git, addingqmd_source_repo/qmd_versionvariables for pinning. - LiteLLM: Installation source points to
https://github.com/ai-workspace-services/litellm.git, addinglitellm_source_repo/litellm_versionvariables for pinning.
10. Concurrency Optimization Design (Deep Analysis + Custom Strategy)
Goal: Improve single-machine deployment speed without dropping tasks, breaking existing role structures, or sacrificing stability. Overall Strategy: Three-phase execution — Phase 1 Sequential (system global/lock grabbing) → Phase 2 Concurrent (mutually independent I/O) → Phase 3 Sequential (deterministic closing). Do not blindly convert multiple roles to concurrent; only make tasks that are "time-consuming, independent, non-writing to the same file, non-grabbing the same lock"
async, and finally close withasync_status.
10.1 Three-Phase Model (Authoritative Definition)
Phase 1 — Must be sequential (grabbing locks / modifying system global state):
apt update, apt install, dpkg related, adding apt repo / keyring, user/group creation, base directory creation, base permissions setting, Docker installation, Caddy installation, systemd base preparation, basic firewall rules, global pip / global npm(-g) installation.
Phase 2 — Can be concurrent (mutually independent, no same-file writing, no same-lock grabbing):
docker pull multiple images, downloading multiple binaries, git clone multiple repos, go build, npm/pnpm install in different directories, frontend builds in different directories, pulling plugins, pulling static assets, generating non-conflicting service configurations, initializing independent working directories for each service, independent prepare scripts for each service.
Phase 3 — Must be sequential (deterministic closing):
Rendering final configurations, systemd daemon-reload, enable service, start/restart in dependency order, health checks, outputting deployment results, cleaning temporary files.
10.2 Key Customization Conclusions (Deep Analysis for this Playbook)
- All
npm -gshare the same prefix → must be Phase 1 sequential.roles/vhosts/nodejssetsnpm_config_prefix=/usr/local/lib/npm; Agent CLI (opencode-ai / @google/gemini-cli / @openai/codex / @anthropic-ai/claude-code),yarn,openclaw@verall usenpm -gto this prefix. Concurrency would contend for the samenode_modules/.stagingand npm cache lock → Cannot be concurrent. - LiteLLM has been changed to a standalone Python 3.13 venv, but dependency installation must still be sequential closing. It no longer writes to system site-packages, but
pip install litellm[proxy]has a large dependency tree and high network failure rate. The default direction should be to consume offline wheelhouse first, with online venv installation only as a fallback. - Truly safe Phase 2 candidates are "External I/O prefetching": git clone, binary downloads, docker pull, frontend builds in separate directories, runtime release downloads. They do not touch dpkg/npm-prefix/pip global locks and write to their own distinct paths.
- The greatest concurrency benefit across sub-playbooks is at the Shell prefetch layer: 11 steps are sequentially imported by ansible, making inter-play concurrency difficult; lifting parallelizable I/O to the Phase 2 fork pool in bootstrap (§10.5) for prefetching, while ansible only consumes ready artifacts, yields the highest risk/reward ratio.
- Offline packages priority (addresses TODO): When offline installation packages/imported images exist, Phase 2 prefetching should short-circuit and skip, directly reusing caches.
10.3 Current Tasks → Three-Phase Mapping
| Step / Role | Phase 1 (Sequential) | Phase 2 (Concurrent prefetchable) | Phase 3 (Sequential closing) |
|---|---|---|---|
| 1 nodejs | nodesource keyring/repo, apt install nodejs, npm -g yarn |
— | — |
| 2 console | apt(caddy/xfce4/python3/golang-go/chrome)+chrome repo/key, users/dirs/perms | get_url ttyd binary, git clone console, dashboard npm install && build (independent dir) |
render systemd unit/env/portal-services.json, daemon-reload/enable/restart, Caddy write+reload |
| 3 ai_agent_runtime | npm -g Agent CLI, global pip(python deps), apt(browser/docs/fonts), Playwright(-g) |
agent_skills pull core-skills market (independent dir) |
validation/health, register output |
| 4 gateway_openclaw | npm -g openclaw@ver+plugins |
(plugins can be concurrent if pulled to independent dirs) | configuration rendering, systemd, version assert, health |
| 5 bridge + ACP | sync console; global install parts of acp_server_* | xworkmate-go-core binary download/placement, acp independent working directory prepare |
render configs, start in requires acp-*.service order, validation |
| 6 vault | (systemd base prep) | get_url vault zip download, extract and place |
render config, systemd/init, health |
| 7 postgres | Docker install, common base | docker pull PG image, initialize independent data dir |
compose render, compose up, health |
| 8 litellm | apt/Homebrew Python prep, Python 3.13 venv creation, offline wheelhouse or fallback pip install | Download litellm-runtime-<distro>-<version>-<arch>.tar.gz, SHA256 validation, prep packages/pip/metadata/runtime.env |
Config render, Prisma client generate, systemd/launchd, health(:4000/health) |
| 9 qmd | (bun runtime install, global) | conditional concurrency: pull qmd/bun install (isolated to ~/.bun, does not touch dpkg) |
qmd.env/index.yml render, systemd --user, health(:8181) |
| 11 xfce (opt) | apt desktop packages/xrdp/chrome, npm -g/Playwright |
— | xrdp service enable/start, session config |
Note: Items marked "conditional concurrency" (like qmd
bun) are included in Phase 2 only when confirmed to write strictly to the service's own user directory and not contend for global locks with other installations at the same time; otherwise, they fall into Phase 1.
10.4 Ansible Layer async Mode (Retaining all properties)
Within a single play, initiate Phase 2 tasks using poll:0 and centrally close them with async_status. register/when/notify/tags/become/failed_when are always retained:
- name: Download ttyd binary (async)
ansible.builtin.get_url: { url: "...", dest: "{{ ttyd_path }}", mode: "0755" }
async: 1800
poll: 0
register: ttyd_job
- name: Clone xworkspace-console (async)
ansible.builtin.git: { repo: "...", dest: "{{ repo_dir }}", version: main, depth: 1 }
become_user: "{{ xworkspace_console_user }}"
async: 1800
poll: 0
register: console_clone_job
# ...other independent Phase 2 tasks initiated with poll:0...
- name: Collect async Phase-2 jobs
ansible.builtin.async_status: { jid: "{{ item }}" }
register: p2
until: p2.finished
retries: 120
delay: 5
loop:
- "{{ ttyd_job.ansible_job_id }}"
- "{{ console_clone_job.ansible_job_id }}"
- Iron rule for closing: Any Phase 2 product must be
finishedbefore being consumed by Phase 3. - dpkg / global npm / global pip are never
async; although LiteLLM venv installation is no longer a global pip, it should also run sequentially after the wheelhouse preparation is complete, facilitating failure isolation and retry (§10.2).
10.5 Shell Layer Dynamic fork Concurrency (≤ CPU Cores × 2, prefetch layer)
Bootstrap converges parallelizable external I/O into a load-adaptive bounded fork pool, used before ansible (Phase 2 prefetch) and at the summary stage. The hard limit is 2 times the online CPU cores of the target host; AI_WORKSPACE_MAX_PARALLEL_JOBS can set a lower manual limit, defaulting to auto. Before starting each sub-task, it reads the 1-minute load average, dynamically shrinking based on min(manual limit, 2 × CPU - ceil(load1)), reserving at least 1 slot:
CPU_COUNT="$(getconf _NPROCESSORS_ONLN)"
HARD_LIMIT=$((CPU_COUNT * 2))
LOAD_CEILING="$(awk -v load="$(cut -d' ' -f1 /proc/loadavg)" 'BEGIN { n=int(load); print load > n ? n + 1 : n }')"
DYNAMIC_LIMIT=$((HARD_LIMIT - LOAD_CEILING))
[ "$DYNAMIC_LIMIT" -ge 1 ] || DYNAMIC_LIMIT=1
run_bounded() {
while [ "$(jobs -rp | wc -l)" -ge "$DYNAMIC_LIMIT" ]; do wait -n; done
"$@" &
}
# Phase 2 prefetch: pull 5 repos + download binaries + pull images (short-circuited if offline packages exist)
for r in playbooks console core-skills qmd litellm; do run_bounded fetch_repo "$r"; done
for b in ttyd vault xworkmate-go-core; do run_bounded fetch_binary "$b"; done
for img in "${PG_IMAGES[@]}"; do run_bounded docker_pull "$img"; done
for p in "${pids[@]}"; do wait "$p" || rc=1; done
[ "$rc" -eq 0 ] || { echo "[phase2] Sub-tasks failed"; exit 1; }
- Health check fan-out (before summary): Use the same dynamic limit for
systemctl is-active+curlhealth endpoints of Portal/Bridge/OpenClaw/QMD/Hermes/PG/Vault/LiteLLM, summarizing them in a fixed order. - Each child process has a log prefix (
[repo:qmd]/[bin:vault]), exits non-zero on failure, and is not silenced. - Sequential preservation: The main
ansible-playbookexecution (Phase 1/Phase 3 guaranteed internally), one-time token/summary printing.
10.6 Content that Must Not Be Lost (Hard Constraints)
Retain all existing tasks and properties one by one: apt/package, users/dirs/perms, env files, systemd unit rendering, Caddy/Nginx, Docker/compose, service starts, health checks, debug, failure handling, handlers, tags, become, when, notify, register. Do not delete/merge/skip any existing task for the sake of concurrency; only change "when to wait" (poll:0+async_status), not "what to do".
10.7 Safe Global Acceleration (Complementary to async, does not change task semantics)
ansible.cfg (already exists) can overlay low-risk items:
[defaults]
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts
[ssh_connection]
pipelining = true
And address TODO concerns: APT/deployment locks require safe waiting (retry rather than forceful lock deletion) to ensure secondary idempotent execution succeeds. strategy: free offers limited single-machine benefit and changes the execution feel, so it is disabled by default.
10.8 Acceptance (Equivalence Regression)
- The task sets from
ansible-playbook --list-tasksare identical before and after optimization (no loss/merges). - Every
asynctask has a correspondingasync_statusclose, with no dangling jobs. - Phase 1 (apt/global npm/global pip/dpkg, LiteLLM venv install) and Phase 3 (daemon-reload/enable/start/health/summary/cleanup) remain strictly sequential.
- Phase 2 tasks do not write to the same file or grab the same lock; they short-circuit and skip when offline packages exist.
- Two consecutive executions both succeed; the idempotent behavior of
changed=0remains unchanged; failed sub-tasks in the Shell fork pool exit non-zero with visible logs.