playbooks

Author	SHA1	Message	Date
Haitao Pan	55a05da3bf	feat: add XWorkmate install redirect (#23 ) Co-authored-by: Haitao Pan <manbuzhe2009@qq.com>	2026-06-29 15:47:04 +08:00
Haitao Pan	477b52c516	fix(acp_server_opencode): detect opencode CLI at deploy time (portable across Debian/Ubuntu/macOS) (#22 ) Stop assuming a fixed opencode path. Probe the real binary with 'command -v' using the role PATH, then feed the resolved path to both the systemd unit and the launchd plist (plist now also passes -opencode-bin). Falls back to the OS-aware default when opencode is not yet installed. Also remove the dead acp-bridge.service.j2 template: it was not deployed by any task and referenced two undefined vars (acp_opencode_bridge_disabled_binary_path, acp_opencode_bridge_opencode_binary_path) — a hardcoding landmine. Co-authored-by: Haitao Pan <manbuzhe2009@qq.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 15:31:54 +08:00
Haitao Pan	4364786465	fix(acp_server_opencode): service PATH + bin var + surface adapter crash in validate (#21 ) ACP readiness probe returned 000 for the full retry window on xworkmate-bridge-ubuntu-26 (nothing listening = adapter crash-loop), but the play aborted at the probe so the real cause never reached the CI log. - systemd unit: add Environment=PATH ({{ acp_opencode_path }}, parity with the launchd plist) so the lazily-spawned opencode/node CLI resolves; replace the hardcoded --opencode-bin /usr/bin/opencode with {{ acp_opencode_binary_path }} ({{ npm_global_bin }}/opencode), matching the gemini/codex roles and macOS. - validate.yml: wrap the readiness probe in block/rescue that dumps systemctl status + journalctl on failure, so the adapter crash reason is visible. - fix latent undefined var in the summary (acp_opencode_adapter_http -> acp_opencode_adapter_probe), which would have errored once the endpoint came up. Co-authored-by: Haitao Pan <manbuzhe2009@qq.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 15:25:32 +08:00
Haitao Pan	d806ba9d3d	fix: update litellm mainstream models registration and gateway defaults	2026-06-27 14:49:18 +08:00
Haitao Pan	a2ce5b9d05	fix(cloudflare): prefer DNS scoped token	2026-06-27 13:48:19 +08:00
Haitao Pan	19a3c9f72a	fix(macos): select architecture Homebrew explicitly	2026-06-27 12:45:34 +08:00
Haitao Pan	5c74feb860	fix(cloudflare_dns): prefer CLOUDFLARE_API_TOKEN over CLOUDFLARE_DNS_API_TOKEN Align the DNS role's token resolution with the rest of the stack, which exports the generic CLOUDFLARE_API_TOKEN. The dedicated *_DNS_API_TOKEN now acts as the fallback, both for play vars and the environment lookup. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 11:31:08 +08:00
Haitao Pan	abee312617	fix(xfce/nodejs): explicit nodejs_version fallback (omit sentinel leaked into repo URL) Previous default(omit) was wrong: in include_role vars, omit does not fall back to the role default — it injects the omit placeholder, which rendered as node_<<Omit>>.x in the NodeSource apt repo URL and failed apt update. Use an explicit fallback to the nodejs role's documented default (22.22.3). Avoids both the 2.19 self-reference recursion and the omit-sentinel leak. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-26 10:56:36 +08:00
Haitao Pan	cd9a783de7	fix(xworkmate_bridge): align Caddy SSE timeouts with bridge 60min max wait Caddy /acp* used read/write_timeout 30m while the bridge max gateway wait is 60min, so long tasks had their SSE killed at the edge (ACP_HTTP_CONNECTION_CLOSED) while OpenClaw kept running. /api, /artifacts/ and / also lacked flush_interval and long timeouts, making polling/streaming fragile. - T1: introduce xworkmate_bridge_acp_stream_timeout (70m = 60min cap + grace), acp_dial_timeout, acp_upstream_keepalive; drive /acp* read/write_timeout from it. - T2: apply flush_interval -1 + the same long timeouts to /api, /artifacts/, /. - Update validate.yml assertions to reference the vars instead of hardcoded 30m. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 10:49:01 +08:00
Haitao Pan	8fcff61855	fix(ai_agent_runtime): resolver must verify browser actually runs, skip disabled stub The Chromium resolver accepted any candidate that merely existed (command -v / -x), so it selected xfce's intentionally-disabled /usr/local/bin/chromium stub (exits 126 "Chromium is disabled, use google-chrome") over the working google-chrome. The later "Check chromium version" verify then failed rc=126. Latent on fresh hosts (depends on role ordering vs the stub install) and deterministic on any re-run. Now require `<candidate> --version` to succeed before accepting, so the stub is skipped and google-chrome is resolved. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-26 10:42:06 +08:00
Haitao Pan	5d00d700ca	fix(xfce/nodejs): drop self-referential nodejs_version (Ansible 2.19 recursion) include_role passed `nodejs_version: "{{ ai_agent_runtime_nodejs_version \| default(nodejs_version) }}"` — a var named nodejs_version whose template references nodejs_version itself. Ansible 2.19+'s lazy templating detects the self-reference in the AST and fails the nodejs role's `nodejs_version_major` set_fact with "Recursive loop detected: maximum recursion depth exceeded". Use default(omit) so the nodejs role's own default applies when the ai_agent_runtime override is absent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-26 10:34:52 +08:00
Haitao Pan	50dba213ee	feat: implement postgresql.svc.plus docker deployment role	2026-06-26 10:00:00 +08:00
Haitao Pan	c62386f30c	fix(postgres): own PGDATA by container uid so re-runs don't break access On re-run, "Ensure compose directories exist" reset the bind-mounted data dir to root:root 0700. The official postgres image only chowns/initdb's an EMPTY PGDATA, so a non-empty data dir stayed root-owned while the backend runs as uid 999 -> "could not open file global/pg_filenode.map: Permission denied" (pg_isready still passes, masking it; ALTER USER / real queries fail). Split the dir task: compose project dir stays root:root; data dir is created owned by postgresql_container_uid/gid (default 999), idempotent across re-runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-25 22:42:11 +08:00
Haitao Pan	29e60383e3	fix(xfce_browser): allow_downgrade on Chrome install to avoid downgrade hard-fail When a host's Chrome apt repo already carries a newer build than a pinned version, apt refuses with "Packages were downgraded and -y was used without --allow-downgrades". Set allow_downgrade: true so an explicit (older-but-available) pin installs cleanly. Complements the empty-default fix (`e174e8b`): default path installs latest, pinned path now also robust. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-25 22:34:35 +08:00
Haitao Pan	e174e8bcfa	fix(xfce_browser): stop pinning Chrome build + fix broken availability regex Deploy failed on ubuntu26 with "no available installation candidate for google-chrome-stable=149.0.7827.114-1": Google's apt repo only ever carries the current stable, so any pinned build vanishes within weeks. Two fixes: - defaults: xfce_google_chrome_version "" (install latest google-chrome-stable); pin is opt-in and now safe (auto-falls back to latest when the pin is gone). - browser.yml: the madison availability guard used POSIX [[:space:]], which Python re does not support, so it never matched ' \| ' separators. Replace with \s — verified: empty->latest, pinned+available->pin, pinned+gone->latest. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-25 22:14:46 +08:00
Haitao Pan	5aadb4f0dc	fix(xfce): fall back when pinned chrome apt version is unavailable	2026-06-25 20:32:47 +08:00
Haitao Pan	c9919284e0	fix(bridge): avoid embedded templates in caddy assertion	2026-06-25 20:26:38 +08:00
Haitao Pan	5984a75643	fix(litellm): provision Python 3.13 via uv when system python >=3.14 litellm's pinned fork requires Python <3.14; Ubuntu 26.04 ships 3.14 with no 3.13/3.12 in apt, so the venv pip install fails ('requires a different Python'). When the bootstrap interpreter is >=3.14, install a standalone Python 3.13 via uv, rebuild the venv with it, and proceed. Debian 13 (3.13) is unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 21:23:27 +08:00
Haitao Pan	c7bc68a6dc	fix(acp_server_opencode): robust curl-retry for ACP endpoint readiness The uri probe ran 1s after the service (re)start while the adapter still accepts TCP but doesn't yet answer (read hangs); uri's default 30s timeout + retries/until did not actually loop on a connection timeout, so it failed after one attempt. Replace with a curl retry loop (5s per attempt, up to ~30 tries) — the adapter answers acp.capabilities in ~4ms once ready. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 21:21:37 +08:00
Haitao Pan	609a88ddcf	feat(bridge): fail fast when bridge domain is empty/non-FQDN under Caddy exposure Non-empty pass-through check: xworkmate_bridge_domain feeds /etc/hostname and the caddy site name; an empty/non-FQDN/127.0.0.1 value yields an invalid Caddyfile. Assert a valid FQDN when caddy_enabled (public ingress), with a clear remediation message (set XWORKMATE_BRIDGE_DOMAIN or provide CMDB service_domains). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 20:50:19 +08:00
Haitao Pan	40b7975061	fix(common): install fail2ban via apt on Debian so module_defaults lock_timeout renders Same class as bridge/litellm: ansible.builtin.package dispatched to apt inherits the play's templated module_defaults.apt.lock_timeout un-rendered -> int conversion error -> on-host bootstrap aborts before litellm/qmd. Use apt on Debian, keep package for non-Debian (yum/dnf doesn't inherit the apt default). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 16:05:40 +08:00
Haitao Pan	3709074916	feat(bridge): set host FQDN + caddy site from XWORKMATE_BRIDGE_DOMAIN or CMDB service_domains - xworkmate_bridge_domain falls back to the first CMDB service_domains entry (inventory hostvar / pipeline-injected env) before ai_workspace_public_domain. - New task sets the host's /etc/hostname (and running hostname) to that FQDN on Linux when it's a valid FQDN — never 127.0.0.1/localhost. The caddy site (xworkmate-bridge-site.caddy.j2) already uses the same var. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 15:56:30 +08:00
Haitao Pan	c3a0e40566	fix(bridge,litellm): use apt on Debian so module_defaults lock_timeout renders The runtime plays set module_defaults.apt.lock_timeout to a templated value. When a prerequisite task uses ansible.builtin.package (which dispatches to apt on Debian), that templated default is NOT rendered and the literal '{{ ai_workspace_apt_lock_timeout \| default(900) \| int }}' reaches apt -> 'lock_timeout is of type str ... cannot be converted to an int' -> the whole on-host bootstrap aborts at the xworkmate-bridge prereq, before litellm/qmd ever deploy (hence they were never up). Fix: install prereqs via ansible.builtin.apt on Debian/Ubuntu (template renders like every other apt task); keep ansible.builtin.package for non-Debian Linux (dispatches to yum/dnf, which doesn't inherit the apt default). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 15:54:09 +08:00
Haitao Pan	c3f3b8ac8e	refactor(agent_skills): run on target host, git-clone sources, drop delegate_to localhost Make the role work identically under both execution models: - local/pull (curl\|bash -> ansible-playbook -c local; localhost == host) - remote controller (ansible-playbook -i inventory over ssh; tasks run on host) Changes: - Remove ALL delegate_to: localhost (the old raw 'command: rsync' detected local-vs-remote via ansible_connection, but delegate_to localhost forced it to 'local', so the user@host push branch was dead code -> remote runs wrote to the controller's /root and failed). - Acquire xworkspace-core-skills via ansible.builtin.git clone ON THE HOST (most universal/cross-platform), instead of requiring a controller-side dir. - Merge core skills into the canonical dir with ansible.builtin.copy (remote_src, host-local) instead of raw rsync; installer adapters install directly into the canonical dir on the host. - Drop rsync-only vars/excludes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 14:57:49 +08:00
Haitao Pan	3505ff1c31	fix(ai-workspace): deploy robustness on Debian13/Ubuntu26.04 (py3.13) - setup-xworkspace-console.yaml: - xworkspace_console_user follows ansible_env.USER (was hardcoded ubuntu; mismatched home=/root on root connections -> systemd link 'src does not exist') - runtime apt task async/poll (xfce4 desktop install dropped the SSH session) - api_dir -> bin/ to match prebuilt runtime manifest (apiBinary: bin/xworkspace-api; was api/ -> 203/EXEC crash loop) - roles/ai_agent_runtime/tasks/{main,docs,fonts,browser}.yml: apt lock_timeout (texlive/pandoc raced cloud-init/unattended-upgrades for the dpkg lock) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 03:02:43 +08:00
Haitao Pan	a5e19eff60	chore: qmd version bump, macOS container runtime deps, ignore inventory pycache - roles/vhosts/common: add docker/docker-compose/colima to macOS brew deps (headless container runtime for qmd PG memory-bridge tests) - roles/vhosts/qmd: bump qmd_version - .gitignore: ignore inventory/__pycache__/ Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 21:01:57 +08:00
Haitao Pan	099a144a9e	fix(xworkmate-bridge): define missing xworkmate_bridge_caddy_base_dir xworkmate_bridge_obsolete_caddy_fragment_paths references xworkmate_bridge_caddy_base_dir, but the var was never defined, so the 'Inspect deprecated ACP Caddy fragment' task aborted with 'xworkmate_bridge_caddy_base_dir is undefined'. Define it from the global caddy_config_dir (consistent with the role's other caddy paths), which is already OS-aware (/etc/caddy on Linux, Homebrew prefix on macOS). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 14:42:20 +08:00
Haitao Pan	f5a5979439	fix(acp-gemini): create runtime dirs so service WorkingDirectory exists acp-gemini.service sets WorkingDirectory={{ acp_gemini_workdir }} (~/.gemini) but the role never created it, so systemd failed at step CHDIR (status 200/CHDIR), the adapter never bound 127.0.0.1:8791, and the CORS preflight validation failed after 30 retries. Mirror the opencode role: pre-create the home, .gemini workdir, XDG config and state dirs owned by the service user. Linux/Debian only (guarded != Darwin); macOS uses the launchd path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 14:31:38 +08:00
Haitao Pan	9e81f65a62	fix(openclaw): pull multi-session plugin runtime from deterministic runtime-latest asset The download used releases/latest/download, which GitHub resolves to the human-facing v0.1.12 tag (no runtime asset) -> HTTP 404, failing the deploy on Ubuntu 26.04 (and any platform). Point at the stable runtime-latest release published by the plugin repo's runtime-release workflow, and add a bounded retry around the download for transient network errors. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 14:03:23 +08:00
Haitao Pan	e0bfc765bf	feat(litellm): make model registration idempotent via fallback to /model/update	2026-06-23 13:43:42 +08:00
Haitao Pan	4e183d2d44	fix(litellm): resolve os.environ variables locally before registering models to DB	2026-06-23 13:27:09 +08:00
Haitao Pan	28df3b59d6	feat(openclaw): conditionally render default UI models and providers based on active API keys	2026-06-23 13:09:56 +08:00
Haitao Pan	a0d59c0af1	feat(openclaw): adopt native provider simulation pointing to litellm gateway	2026-06-23 12:42:04 +08:00
Haitao Pan	25b8204b7b	fix(openclaw): use hyphens for litellm models to prevent provider intercept	2026-06-23 12:21:45 +08:00
Haitao Pan	6e260a3425	feat(litellm): ensure deepseek-chat and deepseek-reasoner are registered	2026-06-23 12:18:24 +08:00
Haitao Pan	e7c96675ff	feat(litellm): update model registrations and gateway configurations with API key gating	2026-06-23 11:04:21 +08:00
Haitao Pan	01f1499a60	feat(ai-workspace): consume prebuilt console runtime for final deployment The macOS console API previously ran via `go run .`, which fails under launchd's minimal PATH (no `go`) and recompiles on every launch. Switch to the same prebuilt-runtime consumption model the bridge/qmd/litellm runtimes already use. The ai-workspace role now does final deployment only (never builds): - download xworkspace-console-runtime-<os>-<arch>.tar.gz (incl. darwin-arm64) from the latest-runtime release, or use an offline-staged archive via XWORKSPACE_CONSOLE_RUNTIME_ARCHIVE; - unpack to a per-user system dir (~/.local/share/xworkspace-console), idempotent via a sha256 marker; - read manifest.json to resolve the prebuilt API binary and assert it is a present, executable native binary; - on macOS, deploy a LaunchAgent that sources portal.env and execs the prebuilt binary directly — no go, no Homebrew, no PATH games. The Go API is pure-Go (no cgo), so CI cross-compiles darwin-arm64 cleanly; this role only consumes that artifact. Validated end-to-end on darwin-arm64: packaged binary serves :8788 (200 with token, 401 without) under launchd. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 17:04:55 +08:00
Haitao Pan	a5850cfcee	fix(acp_server_gemini): revert incompatible adapter command syntax and update args for antigravity-cli	2026-06-22 13:59:52 +08:00
Haitao Pan	2a85be5c9b	fix(xworkmate_bridge): remove obsolete IMAGE variable causing undefined errors	2026-06-22 13:55:14 +08:00
Haitao Pan	32e00a8617	fix(litellm,validation): refine model registration and add cross-platform service validation	2026-06-22 13:52:05 +08:00
Haitao Pan	1b2aea005a	Merge branch 'refactor/upgrade-antigravity-cli' # Conflicts: # roles/vhosts/acp_server_gemini/defaults/main.yml # roles/vhosts/acp_server_gemini/templates/gemini.plist.j2	2026-06-22 13:26:30 +08:00
Haitao Pan	93a3067ea4	Merge branch 'codex/openclaw-playbook-concurrency' # Conflicts: # roles/vhosts/gateway_openclaw/templates/openclaw.json.j2 # roles/vhosts/xworkmate_bridge/defaults/main.yml	2026-06-22 13:25:45 +08:00
Haitao Pan	9926a46f76	fix(litellm): percent-encode DB password in DATABASE_URL LiteLLM crash-looped on macOS with Prisma `P1013: invalid port number in database URL`. The shared auth token is generated by `openssl rand -base64` and can contain '/', '+' or '='; injected raw into the DATABASE_URL userinfo, a '/' truncates the authority so the port parses as invalid and proxy startup fails (port 4000 never binds). Percent-encode the password for the DATABASE_URL only, via an explicit reserved-set replace chain ('%' first to avoid double-encoding) since Jinja's urlencode leaves '/' unescaped. The DB user password stays raw in provision-database and LITELLM_DB_PASSWORD, and the URL form decodes back to the identical secret (verified round-trip), so authentication is unchanged. No effect when no DB host is configured. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 12:56:56 +08:00
Haitao Pan	6091b9dbcf	fix(qmd): pin Homebrew node@24 for build and status on macOS `qmd status` aborted with ERR_DLOPEN_FAILED — better-sqlite3 was compiled against NODE_MODULE_VERSION 137 (node@24) but the validate-status task ran under nvm's Node 20 (NODE_MODULE_VERSION 115), because the user's PATH puts nvm node ahead of Homebrew and the task pinned no PATH. Pin `/opt/homebrew/bin` (node@24) ahead of nvm on Darwin for the npm install, npm build, and validate-status tasks so the native module is built and loaded against one consistent Node ABI — the same node@24 the launchd plist already uses. Linux PATH is left unchanged via an ansible_os_family conditional. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 12:43:05 +08:00
Haitao Pan	d9033960fd	fix(qmd): drop undefined nodejs_version from macOS LaunchAgent PATH The QMD launchd plist hardcoded an NVM node path (`~/.nvm/versions/node/{{ nodejs_version }}/bin`), but `nodejs_version` is never defined in the Homebrew-based macOS deploy, so "Deploy QMD LaunchAgent" aborted with `AnsibleUndefinedVariable: 'nodejs_version' is undefined`. QMD is a bun binary and the Linux user unit already uses `.bun/bin:.local/bin:...`. Mirror that for the plist PATH and add the Homebrew prefix (`/opt/homebrew/bin`) for the brew-installed node@24, removing the nvm/nodejs_version dependency entirely (same remedy as the console plist in TC-MAC-005). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 12:32:58 +08:00
Haitao Pan	bbf5260f0d	fix(litellm): put venv bin on PATH for prisma generate on macOS `prisma generate` invokes the `prisma-client-py` generator as a `/bin/sh` subprocess, which is resolved via PATH. Even though the role calls the absolute venv `prisma` binary, the generator console script lives in the same venv bin dir that is not on the default command PATH, so generation failed with "prisma-client-py: command not found" on macOS. Add an `environment.PATH` that prepends the venv bin dir (plus Homebrew prefixes) so the generator subprocess resolves. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 12:17:12 +08:00
Haitao Pan	ce2070e779	fix(litellm): repair macOS dependency version probe one-liner The "Inspect installed LiteLLM dependency versions" probe was written as a multi-line Python program under YAML `>-` folding, which collapses every newline into a space. The resulting single logical line contained a `for ... : try: ... except:` block, which is a SyntaxError. With `failed_when: false` the failure was swallowed, leaving stdout empty, and the subsequent `set_fact` crashed in `from_json('')` with "Expecting value: line 1 column 1 (char 0)". Rewrite the probe as a genuinely single-line program (dict/list comprehensions over importlib.metadata.distributions(), joined by `;`), and harden the decision `set_fact` with `default('{}', true)` so an empty or malformed probe degrades to "install required" instead of aborting the play. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 12:16:57 +08:00
Haitao Pan	f4a30b9e01	fix(litellm): resilient online dependency install litellm[proxy] pulls large wheels (polars-runtime ~46MB) that break mid-stream over slow/mirrored links with IncompleteRead, failing the deploy. Add pip --retries/--resume-retries (resumes partial downloads) + longer timeout, tunable via litellm_pip_* vars, and upgrade pip in the venv first so --resume-retries (pip>=25.1) exists.	2026-06-22 02:42:51 +00:00
Haitao Pan	6a2f05f435	fix(litellm): skip redundant dependency installs	2026-06-21 22:34:34 +08:00
Haitao Pan	71ebe6444c	fix(litellm): isolate runtime in Python 3.13 venv	2026-06-21 21:15:21 +08:00

1 2 3 4 5 ...

384 Commits