Commit Graph

474 Commits

Author SHA1 Message Date
55a05da3bf
feat: add XWorkmate install redirect (#23)
Co-authored-by: Haitao Pan <manbuzhe2009@qq.com>
2026-06-29 15:47:04 +08:00
477b52c516
fix(acp_server_opencode): detect opencode CLI at deploy time (portable across Debian/Ubuntu/macOS) (#22)
Stop assuming a fixed opencode path. Probe the real binary with 'command -v'
using the role PATH, then feed the resolved path to both the systemd unit and
the launchd plist (plist now also passes -opencode-bin). Falls back to the
OS-aware default when opencode is not yet installed.

Also remove the dead acp-bridge.service.j2 template: it was not deployed by any
task and referenced two undefined vars (acp_opencode_bridge_disabled_binary_path,
acp_opencode_bridge_opencode_binary_path) — a hardcoding landmine.

Co-authored-by: Haitao Pan <manbuzhe2009@qq.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 15:31:54 +08:00
4364786465
fix(acp_server_opencode): service PATH + bin var + surface adapter crash in validate (#21)
ACP readiness probe returned 000 for the full retry window on
xworkmate-bridge-ubuntu-26 (nothing listening = adapter crash-loop), but the
play aborted at the probe so the real cause never reached the CI log.

- systemd unit: add Environment=PATH ({{ acp_opencode_path }}, parity with the
  launchd plist) so the lazily-spawned opencode/node CLI resolves; replace the
  hardcoded --opencode-bin /usr/bin/opencode with {{ acp_opencode_binary_path }}
  ({{ npm_global_bin }}/opencode), matching the gemini/codex roles and macOS.
- validate.yml: wrap the readiness probe in block/rescue that dumps systemctl
  status + journalctl on failure, so the adapter crash reason is visible.
- fix latent undefined var in the summary (acp_opencode_adapter_http ->
  acp_opencode_adapter_probe), which would have errored once the endpoint came up.

Co-authored-by: Haitao Pan <manbuzhe2009@qq.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 15:25:32 +08:00
e953d87f07
ci: add release/* branch source validation workflow (#19)
release/* 仅接受 hotfix/* 或带 cherry-pick/backport 标签的 PR。
详见 iac_modules/docs/tldr-github-branch-model.md

Co-authored-by: Haitao Pan <manbuzhe2009@qq.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 12:12:33 +08:00
Haitao Pan
d806ba9d3d fix: update litellm mainstream models registration and gateway defaults 2026-06-27 14:49:18 +08:00
a2ce5b9d05 fix(cloudflare): prefer DNS scoped token 2026-06-27 13:48:19 +08:00
19a3c9f72a fix(macos): select architecture Homebrew explicitly 2026-06-27 12:45:34 +08:00
Haitao Pan
5c74feb860 fix(cloudflare_dns): prefer CLOUDFLARE_API_TOKEN over CLOUDFLARE_DNS_API_TOKEN
Align the DNS role's token resolution with the rest of the stack, which
exports the generic CLOUDFLARE_API_TOKEN. The dedicated *_DNS_API_TOKEN now
acts as the fallback, both for play vars and the environment lookup.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 11:31:08 +08:00
Haitao Pan
9b59c89d80 fix(console): expose Homebrew Go to macOS API service 2026-06-27 09:18:03 +08:00
Haitao Pan
abee312617 fix(xfce/nodejs): explicit nodejs_version fallback (omit sentinel leaked into repo URL)
Previous default(omit) was wrong: in include_role vars, omit does not fall back
to the role default — it injects the omit placeholder, which rendered as
node_<<Omit>>.x in the NodeSource apt repo URL and failed apt update. Use an
explicit fallback to the nodejs role's documented default (22.22.3). Avoids both
the 2.19 self-reference recursion and the omit-sentinel leak.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-26 10:56:36 +08:00
Haitao Pan
cd9a783de7 fix(xworkmate_bridge): align Caddy SSE timeouts with bridge 60min max wait
Caddy /acp* used read/write_timeout 30m while the bridge max gateway wait is
60min, so long tasks had their SSE killed at the edge (ACP_HTTP_CONNECTION_CLOSED)
while OpenClaw kept running. /api*, /artifacts/* and / also lacked flush_interval
and long timeouts, making polling/streaming fragile.

- T1: introduce xworkmate_bridge_acp_stream_timeout (70m = 60min cap + grace),
  acp_dial_timeout, acp_upstream_keepalive; drive /acp* read/write_timeout from it.
- T2: apply flush_interval -1 + the same long timeouts to /api*, /artifacts/*, /.
- Update validate.yml assertions to reference the vars instead of hardcoded 30m.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 10:49:01 +08:00
Haitao Pan
8fcff61855 fix(ai_agent_runtime): resolver must verify browser actually runs, skip disabled stub
The Chromium resolver accepted any candidate that merely existed (command -v /
-x), so it selected xfce's intentionally-disabled /usr/local/bin/chromium stub
(exits 126 "Chromium is disabled, use google-chrome") over the working
google-chrome. The later "Check chromium version" verify then failed rc=126.
Latent on fresh hosts (depends on role ordering vs the stub install) and
deterministic on any re-run. Now require `<candidate> --version` to succeed
before accepting, so the stub is skipped and google-chrome is resolved.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-26 10:42:06 +08:00
Haitao Pan
5d00d700ca fix(xfce/nodejs): drop self-referential nodejs_version (Ansible 2.19 recursion)
include_role passed `nodejs_version: "{{ ai_agent_runtime_nodejs_version |
default(nodejs_version) }}"` — a var named nodejs_version whose template
references nodejs_version itself. Ansible 2.19+'s lazy templating detects the
self-reference in the AST and fails the nodejs role's `nodejs_version_major`
set_fact with "Recursive loop detected: maximum recursion depth exceeded".
Use default(omit) so the nodejs role's own default applies when the
ai_agent_runtime override is absent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-26 10:34:52 +08:00
Haitao Pan
50dba213ee feat: implement postgresql.svc.plus docker deployment role 2026-06-26 10:00:00 +08:00
Haitao Pan
c62386f30c fix(postgres): own PGDATA by container uid so re-runs don't break access
On re-run, "Ensure compose directories exist" reset the bind-mounted data dir
to root:root 0700. The official postgres image only chowns/initdb's an EMPTY
PGDATA, so a non-empty data dir stayed root-owned while the backend runs as uid
999 -> "could not open file global/pg_filenode.map: Permission denied" (pg_isready
still passes, masking it; ALTER USER / real queries fail).

Split the dir task: compose project dir stays root:root; data dir is created
owned by postgresql_container_uid/gid (default 999), idempotent across re-runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-25 22:42:11 +08:00
Haitao Pan
29e60383e3 fix(xfce_browser): allow_downgrade on Chrome install to avoid downgrade hard-fail
When a host's Chrome apt repo already carries a newer build than a pinned version,
apt refuses with "Packages were downgraded and -y was used without
--allow-downgrades". Set allow_downgrade: true so an explicit (older-but-available)
pin installs cleanly. Complements the empty-default fix (e174e8b): default path
installs latest, pinned path now also robust.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-25 22:34:35 +08:00
Haitao Pan
e174e8bcfa fix(xfce_browser): stop pinning Chrome build + fix broken availability regex
Deploy failed on ubuntu26 with "no available installation candidate for
google-chrome-stable=149.0.7827.114-1": Google's apt repo only ever carries the
current stable, so any pinned build vanishes within weeks.

Two fixes:
- defaults: xfce_google_chrome_version "" (install latest google-chrome-stable);
  pin is opt-in and now safe (auto-falls back to latest when the pin is gone).
- browser.yml: the madison availability guard used POSIX [[:space:]], which
  Python re does not support, so it never matched ' | ' separators. Replace with
  \s — verified: empty->latest, pinned+available->pin, pinned+gone->latest.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-25 22:14:46 +08:00
Haitao Pan
5aadb4f0dc fix(xfce): fall back when pinned chrome apt version is unavailable 2026-06-25 20:32:47 +08:00
Haitao Pan
c9919284e0 fix(bridge): avoid embedded templates in caddy assertion 2026-06-25 20:26:38 +08:00
Haitao Pan
5984a75643 fix(litellm): provision Python 3.13 via uv when system python >=3.14
litellm's pinned fork requires Python <3.14; Ubuntu 26.04 ships 3.14 with no
3.13/3.12 in apt, so the venv pip install fails ('requires a different Python').
When the bootstrap interpreter is >=3.14, install a standalone Python 3.13 via
uv, rebuild the venv with it, and proceed. Debian 13 (3.13) is unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 21:23:27 +08:00
Haitao Pan
c7bc68a6dc fix(acp_server_opencode): robust curl-retry for ACP endpoint readiness
The uri probe ran 1s after the service (re)start while the adapter still accepts
TCP but doesn't yet answer (read hangs); uri's default 30s timeout + retries/until
did not actually loop on a connection timeout, so it failed after one attempt.
Replace with a curl retry loop (5s per attempt, up to ~30 tries) — the adapter
answers acp.capabilities in ~4ms once ready.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 21:21:37 +08:00
Haitao Pan
609a88ddcf feat(bridge): fail fast when bridge domain is empty/non-FQDN under Caddy exposure
Non-empty pass-through check: xworkmate_bridge_domain feeds /etc/hostname and the
caddy site name; an empty/non-FQDN/127.0.0.1 value yields an invalid Caddyfile.
Assert a valid FQDN when caddy_enabled (public ingress), with a clear remediation
message (set XWORKMATE_BRIDGE_DOMAIN or provide CMDB service_domains).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:50:19 +08:00
Haitao Pan
40b7975061 fix(common): install fail2ban via apt on Debian so module_defaults lock_timeout renders
Same class as bridge/litellm: ansible.builtin.package dispatched to apt inherits
the play's templated module_defaults.apt.lock_timeout un-rendered -> int conversion
error -> on-host bootstrap aborts before litellm/qmd. Use apt on Debian, keep
package for non-Debian (yum/dnf doesn't inherit the apt default).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 16:05:40 +08:00
Haitao Pan
3709074916 feat(bridge): set host FQDN + caddy site from XWORKMATE_BRIDGE_DOMAIN or CMDB service_domains
- xworkmate_bridge_domain falls back to the first CMDB service_domains entry
  (inventory hostvar / pipeline-injected env) before ai_workspace_public_domain.
- New task sets the host's /etc/hostname (and running hostname) to that FQDN on
  Linux when it's a valid FQDN — never 127.0.0.1/localhost. The caddy site
  (xworkmate-bridge-site.caddy.j2) already uses the same var.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 15:56:30 +08:00
Haitao Pan
c3a0e40566 fix(bridge,litellm): use apt on Debian so module_defaults lock_timeout renders
The runtime plays set module_defaults.apt.lock_timeout to a templated value.
When a prerequisite task uses ansible.builtin.package (which dispatches to apt
on Debian), that templated default is NOT rendered and the literal
'{{ ai_workspace_apt_lock_timeout | default(900) | int }}' reaches apt ->
'lock_timeout is of type str ... cannot be converted to an int' -> the whole
on-host bootstrap aborts at the xworkmate-bridge prereq, before litellm/qmd
ever deploy (hence they were never up).

Fix: install prereqs via ansible.builtin.apt on Debian/Ubuntu (template renders
like every other apt task); keep ansible.builtin.package for non-Debian Linux
(dispatches to yum/dnf, which doesn't inherit the apt default).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 15:54:09 +08:00
Haitao Pan
c3f3b8ac8e refactor(agent_skills): run on target host, git-clone sources, drop delegate_to localhost
Make the role work identically under both execution models:
- local/pull (curl|bash -> ansible-playbook -c local; localhost == host)
- remote controller (ansible-playbook -i inventory over ssh; tasks run on host)

Changes:
- Remove ALL delegate_to: localhost (the old raw 'command: rsync' detected
  local-vs-remote via ansible_connection, but delegate_to localhost forced it
  to 'local', so the user@host push branch was dead code -> remote runs wrote
  to the controller's /root and failed).
- Acquire xworkspace-core-skills via ansible.builtin.git clone ON THE HOST
  (most universal/cross-platform), instead of requiring a controller-side dir.
- Merge core skills into the canonical dir with ansible.builtin.copy
  (remote_src, host-local) instead of raw rsync; installer adapters install
  directly into the canonical dir on the host.
- Drop rsync-only vars/excludes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 14:57:49 +08:00
Haitao Pan
2ef144d572 fix(console): serve dashboard/dist via local python http.server (not npm/caddy)
Prebuilt runtime ships only dashboard/dist (no package.json) so npm run
preview ENOENT-crash-loops (254). console is a local-only static backend on
127.0.0.1:17000 (dashboard is a routerless SPA); serve it with python3
-m http.server on both Linux (console.service) and macOS (console.plist) —
no second caddy (avoids clashing with the system caddy on :80; console is
local-only and not proxied by default). Gate the apt caddy install on
caddy_enabled (true on public-IP Linux VPS for the bridge ingress; macOS
installs no caddy).

Verified: debian13 + ubuntu26.04 console.service active serving 17000=200;
macOS python3 serves the same dist locally.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 09:44:01 +08:00
Haitao Pan
3505ff1c31 fix(ai-workspace): deploy robustness on Debian13/Ubuntu26.04 (py3.13)
- setup-xworkspace-console.yaml:
  - xworkspace_console_user follows ansible_env.USER (was hardcoded ubuntu;
    mismatched home=/root on root connections -> systemd link 'src does not exist')
  - runtime apt task async/poll (xfce4 desktop install dropped the SSH session)
  - api_dir -> bin/ to match prebuilt runtime manifest (apiBinary: bin/xworkspace-api;
    was api/ -> 203/EXEC crash loop)
- roles/ai_agent_runtime/tasks/{main,docs,fonts,browser}.yml: apt lock_timeout
  (texlive/pandoc raced cloud-init/unattended-upgrades for the dpkg lock)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 03:02:43 +08:00
Haitao Pan
a5e19eff60 chore: qmd version bump, macOS container runtime deps, ignore inventory pycache
- roles/vhosts/common: add docker/docker-compose/colima to macOS brew deps
  (headless container runtime for qmd PG memory-bridge tests)
- roles/vhosts/qmd: bump qmd_version
- .gitignore: ignore inventory/__pycache__/

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:01:57 +08:00
Haitao Pan
df48cb4f5a feat(inventory): add Terraform CMDB dynamic inventory for ai-workspace
Reads cmdb.json produced by iac_modules vultr-vps/envs/ai-workspace
generate.py and exposes hosts/groups/hostvars to Ansible, linking IaC
provisioning to playbook deploys (terraform_cmdb.py).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 20:57:58 +08:00
Haitao Pan
099a144a9e fix(xworkmate-bridge): define missing xworkmate_bridge_caddy_base_dir
xworkmate_bridge_obsolete_caddy_fragment_paths references
xworkmate_bridge_caddy_base_dir, but the var was never defined, so the
'Inspect deprecated ACP Caddy fragment' task aborted with
'xworkmate_bridge_caddy_base_dir is undefined'. Define it from the global
caddy_config_dir (consistent with the role's other caddy paths), which is
already OS-aware (/etc/caddy on Linux, Homebrew prefix on macOS).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 14:42:20 +08:00
Haitao Pan
f5a5979439 fix(acp-gemini): create runtime dirs so service WorkingDirectory exists
acp-gemini.service sets WorkingDirectory={{ acp_gemini_workdir }} (~/.gemini)
but the role never created it, so systemd failed at step CHDIR (status
200/CHDIR), the adapter never bound 127.0.0.1:8791, and the CORS preflight
validation failed after 30 retries. Mirror the opencode role: pre-create the
home, .gemini workdir, XDG config and state dirs owned by the service user.
Linux/Debian only (guarded != Darwin); macOS uses the launchd path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 14:31:38 +08:00
Haitao Pan
e5fc29fa8a fix(console): download runtime from deterministic latest-runtime tag
The online runtime download used releases/latest/download, which GitHub
resolves to whichever release holds the 'Latest' flag. The console repo also
publishes offline-ai-workspace-* build releases that take that flag and carry
no console runtime asset -> HTTP 404 on the online/Debian path. Point at the
stable latest-runtime release (published by the console-runtime workflow) and
add a bounded download retry. The env-provided archive path still wins via the
existing when-guard, so offline/bundled installs are unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 14:19:03 +08:00
Haitao Pan
9e81f65a62 fix(openclaw): pull multi-session plugin runtime from deterministic runtime-latest asset
The download used releases/latest/download, which GitHub resolves to the
human-facing v0.1.12 tag (no runtime asset) -> HTTP 404, failing the deploy
on Ubuntu 26.04 (and any platform). Point at the stable runtime-latest
release published by the plugin repo's runtime-release workflow, and add a
bounded retry around the download for transient network errors.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 14:03:23 +08:00
e0bfc765bf feat(litellm): make model registration idempotent via fallback to /model/update 2026-06-23 13:43:42 +08:00
4e183d2d44 fix(litellm): resolve os.environ variables locally before registering models to DB 2026-06-23 13:27:09 +08:00
28df3b59d6 feat(openclaw): conditionally render default UI models and providers based on active API keys 2026-06-23 13:09:56 +08:00
a0d59c0af1 feat(openclaw): adopt native provider simulation pointing to litellm gateway 2026-06-23 12:42:04 +08:00
25b8204b7b fix(openclaw): use hyphens for litellm models to prevent provider intercept 2026-06-23 12:21:45 +08:00
6e260a3425 feat(litellm): ensure deepseek-chat and deepseek-reasoner are registered 2026-06-23 12:18:24 +08:00
e7c96675ff feat(litellm): update model registrations and gateway configurations with API key gating 2026-06-23 11:04:21 +08:00
Haitao Pan
01f1499a60 feat(ai-workspace): consume prebuilt console runtime for final deployment
The macOS console API previously ran via `go run .`, which fails under
launchd's minimal PATH (no `go`) and recompiles on every launch. Switch to
the same prebuilt-runtime consumption model the bridge/qmd/litellm runtimes
already use.

The ai-workspace role now does final deployment only (never builds):
- download xworkspace-console-runtime-<os>-<arch>.tar.gz (incl. darwin-arm64)
  from the latest-runtime release, or use an offline-staged archive via
  XWORKSPACE_CONSOLE_RUNTIME_ARCHIVE;
- unpack to a per-user system dir (~/.local/share/xworkspace-console),
  idempotent via a sha256 marker;
- read manifest.json to resolve the prebuilt API binary and assert it is a
  present, executable native binary;
- on macOS, deploy a LaunchAgent that sources portal.env and execs the
  prebuilt binary directly — no go, no Homebrew, no PATH games.

The Go API is pure-Go (no cgo), so CI cross-compiles darwin-arm64 cleanly;
this role only consumes that artifact. Validated end-to-end on darwin-arm64:
packaged binary serves :8788 (200 with token, 401 without) under launchd.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 17:04:55 +08:00
Haitao Pan
a5850cfcee fix(acp_server_gemini): revert incompatible adapter command syntax and update args for antigravity-cli 2026-06-22 13:59:52 +08:00
Haitao Pan
2a85be5c9b fix(xworkmate_bridge): remove obsolete IMAGE variable causing undefined errors 2026-06-22 13:55:14 +08:00
Haitao Pan
32e00a8617 fix(litellm,validation): refine model registration and add cross-platform service validation 2026-06-22 13:52:05 +08:00
Haitao Pan
0ac424f00e Merge branch 'xworkspace-portal-dashboard-17000'
# Conflicts:
#	setup-xworkspace-console.yaml
2026-06-22 13:27:37 +08:00
Haitao Pan
1b2aea005a Merge branch 'refactor/upgrade-antigravity-cli'
# Conflicts:
#	roles/vhosts/acp_server_gemini/defaults/main.yml
#	roles/vhosts/acp_server_gemini/templates/gemini.plist.j2
2026-06-22 13:26:30 +08:00
Haitao Pan
93a3067ea4 Merge branch 'codex/openclaw-playbook-concurrency'
# Conflicts:
#	roles/vhosts/gateway_openclaw/templates/openclaw.json.j2
#	roles/vhosts/xworkmate_bridge/defaults/main.yml
2026-06-22 13:25:45 +08:00
Haitao Pan
9926a46f76 fix(litellm): percent-encode DB password in DATABASE_URL
LiteLLM crash-looped on macOS with Prisma `P1013: invalid port number in
database URL`. The shared auth token is generated by `openssl rand -base64`
and can contain '/', '+' or '='; injected raw into the DATABASE_URL
userinfo, a '/' truncates the authority so the port parses as invalid and
proxy startup fails (port 4000 never binds).

Percent-encode the password for the DATABASE_URL only, via an explicit
reserved-set replace chain ('%' first to avoid double-encoding) since
Jinja's urlencode leaves '/' unescaped. The DB user password stays raw in
provision-database and LITELLM_DB_PASSWORD, and the URL form decodes back
to the identical secret (verified round-trip), so authentication is
unchanged. No effect when no DB host is configured.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 12:56:56 +08:00
Haitao Pan
ef67c61cf7 fix(xfce): skip Linux XFCE/XRDP desktop stack on macOS
The all-in-one flow reached "Update apt cache" in the
xfce_desktop_minimal_runtime role on macOS and failed with
`[Errno 2] No such file or directory: b'update'` (no apt on Darwin).

XFCE + XRDP is a Linux remote-desktop stack and is meaningless on macOS,
which already has a native GUI. Guard both role includes in
setup-xfce-xrdp.yaml with `ansible_os_family != 'Darwin'` so the apt/systemd
tasks never run there. Linux behavior is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 12:46:31 +08:00