* fix(artifacts): prioritize PDF deliverables in sidebar * docs(cases): add gateway turn acceptance summary * ci: add release/* branch source validation workflow (#19) release/* 仅接受 hotfix/* 或带 cherry-pick/backport 标签的 PR。 详见 iac_modules/docs/tldr-github-branch-model.md Co-authored-by: Haitao Pan <haitao.pan@xworkmate.ai> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * ci: run desktop integration/patrol tests under xvfb (#22) Headless Linux runners have no display, so 'flutter test integration_test' fails to launch the GTK app ('The log reader stopped unexpectedly, or never started'). Wrap integration/patrol layers in xvfb-run with a 24-bit screen and install xvfb + mesa DRI driver for headless GL. macOS/local runs are unaffected (no xvfb-run -> command runs directly). Co-authored-by: Haitao Pan <haitao.pan@xworkmate.ai> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * fix: reveal artifact files without blocking * fix: reveal artifact files without blocking (#20) Co-authored-by: Haitao Pan <haitao.pan@xworkmate.ai> * Release/v1.1.5 (#25) * ci: backport release/* source validation workflow to release/v1.1.5 (#21) 让现有 release/v1.1.5 分支自身包含门禁 workflow(pull_request_target 用 base 分支版本)。 详见 iac_modules/docs/tldr-github-branch-model.md Co-authored-by: Haitao Pan <haitao.pan@xworkmate.ai> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * fix: reveal artifact files without blocking (#24) Co-authored-by: Haitao Pan <haitao.pan@xworkmate.ai> --------- Co-authored-by: Haitao Pan <haitao.pan@xworkmate.ai> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * chore: update tested linux labels (#23) Co-authored-by: Haitao Pan <haitao.pan@xworkmate.ai> * chore: sync app version to 1.1.5 (#26) * fix: keep stopped gateway tasks out of pending queue * chore: add ios release verification assets * Fix managed bridge token priority * fix: stabilize iOS login storage and mobile settings * Refine assistant attachment payload handling * Fix assistant continue task requeue * Fix mobile account sign-in flow * Fix ACP SSE no-result recovery * Polish assistant UI and add Service Mesh video case * feat(mobile): redesign mobile UX and iOS native experience * feat(mobile): move configuration chips to + menu and add left drawer * feat(mobile): redesign mobile ui to chatgpt minimalist style * fix(mobile): tweak composer submit button size and wire up settings drawer * fix(mobile): remove background from send button * fix(mobile): use blue circle with upward arrow for send button * feat(mobile): add navigation breadcrumb to return to chat from settings * feat(mobile): refine composer ui with minimalist modern aesthetic * Remove OpenClaw direct ACP route * Add desktop navigation integration test * Add desktop settings integration test * Use remote workspace for OpenClaw execution * Handle gateway default task workspace * Keep sidebar task order stable * Hide desktop agent dialog mode * Release v1.1.3 * Fix Apple preflight for main builds * Fix Apple preflight for main builds * Fix Apple preflight for main builds * Add bulk archive task selection * Fix assistant skill picker loading * Stabilize mobile provider sheet test * chore: prepare v1.1.3 release metadata * fix: unblock OpenClaw gateway task queue * fix: keep running task follow-up in current thread * fix: isolate openclaw e2e artifacts * fix(assistant): pin task session on submit * docs: record openclaw gateway e2e cases * test: align openclaw e2e prompts * refactor: classify gateway task load * refactor: classify gateway task load * feat: sync existing workspace directory artifacts recursively * Use manual bridge config for ACP runtime * Fix task refresh layout stability * chore: update core integration cases and runtime helpers * fix: stabilize complex openclaw artifact tasks * fix: repair bridge login sync runtime state * fix: repair bridge login sync runtime state * Fix manual bridge save runtime config * fix: use OpenClaw gateway protocol 4 * fix: use OpenClaw gateway protocol 4 * fix(openclaw): keep artifact runs session scoped * chore(app): refresh build metadata * fix(openclaw): recover final task snapshots * fix(openclaw): recover long SSE task artifacts * test(app): align thread artifact isolation assertions * fix(openclaw): keep long artifact recovery synced * feat(openclaw): implement artifact sync and ignore policies * Reassociate OpenClaw tasks through Bridge control plane * Preserve artifacts after interrupted bridge responses * feat: Remote Desktop UI and Client WebRTC Integration * refactor: simplify remote desktop UI and add maximize toggle * fix(webrtc): pass SDP offer and answer as object to conform to backend format * fix: revert sdpOffer to String to match Bridge SDP expectations * feat: add runtime logs tab to settings page * chore: prepare release v1.1.4 (app store compliance, remote desktop fixes, ci verification) * fix: load nested bridge skills status * fix(ci): parse provider catalog and gateway providers from capabilities fallback * test: stabilize OpenClaw gateway active slot regression * fix: WebRTC remote desktop connection, cleanup local fallback, and ignore .gradle cache * feat: add collapse toggle to desktop control panel * fix(runtime): restore skills loading and group rendering * refactor(ui): eliminate unowned helper sprawl in assistant skill picker * feat: improve webrtc keyboard mapping and add adaptive resolution default * feat: improve webrtc keyboard mapping and add adaptive resolution default * refactor(skills): clean Path B, add retry + auto-refresh, fix silent failures - Remove Path B (direct WebSocket RPC), unify skills loading via ACP bridge sessionClient - Delete skillsStatusPayloadInternal fragile nested-key parsing - SkillsController: explicit error when offline (no more silent empty), auto-retry with 2s/4s backoff - Auto-refresh on gateway connect via ChangeNotifier listener - Gateway connect: concurrent Future.wait for independent controller refreshes - UI: retry button in skill picker empty/error states - Clean up skillsController from relayChildChangeInternal listeners * refactor(skills): fix allowErrorPayload validation, improve auto-refresh guard * feat(ui): apply BoxFit.fill for remote desktop WebRTC view to ensure no blank spaces * refactor: remove multi-agent orchestration subsystem (Path B) Remove the entire multi-agent collaboration execution path, including: - MultiAgentOrchestrator and its 4-phase pipeline (Architect→Engineer→Tester→Iteration) - ARIS framework preset and mount infrastructure - Hardcoded model defaults (kimi-k2.5, minimax-m2.7, glm-5) - Deprecated runCliPromptInternal() and its fallback call chain - All related types: MultiAgentConfig, AgentWorkerConfig, MultiAgentRole, etc. This collapses the architecture to a single clean path: Flutter → GoTaskServiceClient → ACP Transport → Go Bridge → Remote Execution 2886 lines removed across 41 files. * docs(cases): clean up test cases — remove ai-security-evolution scenario, fix issues - Delete ai-security-evolution-content-scenario/ (8 files, referenced by removed MANUAL-LOCAL-001A) - Remove MANUAL-LOCAL-001A from core-integration-manual-cases.md - Fix duplicate section numbering (#5 → #6 for general thread scenarios) - Remove misplaced workspace sync rules from MANUAL-ACP-004 (bridge auth case) - Update README.md index * test,docs: fix all stale references to deleted multi-agent subsystem Test fixes (6 files, -303 lines): - Delete app_controller_acp_mount_resilience_test.dart (entirely about deleted types) - Remove multi-agent test cases from gateway_acp_client_auth_test.dart - Rename _manifestWithDesktopMultiAgentEnabled → _defaultDesktopManifest in assistant_execution_target_test, assistant_lower_pane_test, mobile_assistant_page_test Docs fixes (6 files): - Regenerate public-symbol-inventory.json/md via make docs-public-api - Remove multi-agent sections from public-api/models-and-config.md, app-orchestration.md, runtime-contracts.md - Fix xworkmate/ → xworkmate-app/ paths in cloud-session doc - Remove multiAgent references from app-external-service-api-test-matrix.md * docs: add architecture README with categorized navigation * docs(architecture): fix critical accuracy errors, stale refs, paths Accuracy fixes: - app-orchestration.md: remove non-existent constructor params - models-and-config.md: remove wrong multiAgent field from SettingsSnapshot - runtime-contracts.md: add missing multiAgent/collaborationMode/routingHint fields Stale multi-agent refs: - unified-routing-architecture.md: agent/multi-agent → agent (含 bridge 转发) - bridge-runtime-routing-map.md: multi-agent tasks → multi-agent forwarding tasks - cross-repo-task-state-workflow.md: remove multi-agent orchestration from mermaid - runtime-contracts.md, feature-surfaces.md: 多 agent → agent Organization: - Move cloud-session-service and stage4-helper to archive/ - Fix 22 xworkmate/ → xworkmate-app/ paths in archive doc - Fix XWorkmate.svc.plus repo name in simple-theme-default.md - Update README.md index and public-api/README.md coverage stats (132/590) * docs: rewrite README — fix repo name, remove stale multi-agent refs, add dependencies - Title: XWorkmate → xworkmate-app - Remove references to deleted multi-agent orchestration - Fix download links: xworkmate.svc.plus → xworkmate-app - Replace machine-specific /Users/shenlan/... paths with relative links - Add Dependencies section: xworkmate-bridge, xworkspace-core-skills, openclaw-multi-session-plugins, playbooks - Consolidate Learn More links to repo-relative paths * fix desktop workspace stream fallback * Fix WebRTC desktop video stream rendering and inputs * refactor: eliminate dead codex_runtime methods, add anti-fallback policy codex_runtime.dart (-290 lines): - Remove 17 dead methods behind UnsupportedError guard (findCodexBinary, startStdio, request, startThread, resumeThread, sendMessage, interrupt, getAccount, listModels, listSkills, stop, dispose, _resolveLaunchConfiguration + 3 @visibleForTesting wrappers) - Remove 10 dead fields (_process, _state, _pendingRequests, _events, etc.) - Remove ChangeNotifier mixin (nothing to notify) - Keep only model types, enums, and standalone helper functions AGENTS.md (+21 lines): - Add Fallback and Dead Code Elimination Policy section - Forbidden: cascading fallbacks, lingering DEPRECATED code, dead code behind guards, silent catch blocks, redundant indirection, excessive JSON key probing - Required: inline WHY comments on every retained fallback chain Additional cleanup: - gateway_acp_client.dart: remove unused _GatewayAcpSessionUpdate class - runtime_controllers_entities.dart: replace _canRefreshThroughRuntime with runtimeInternal.isConnected - runtime_models_gateway_entities.dart: relocate CollaborationAttachment * Simplify RTCVideoView constraints and disable adaptive resolution by default * refactor: remove stale runtime fallbacks * fix: preserve openclaw failure artifacts * fix: use default native track attach for desktop stream * fix: poll openclaw task handle to terminal snapshot * update architecture docs * fix: finalize openclaw task polling results * feat(xworkmate): optimize desktop thread actions and Go task service client * docs: add cross-repo architecture chain maps and risk analysis - Add 4 chain maps: task-execution, artifact-lifecycle, session-recovery, bridge-distributed - Add cross-repo call analysis with top-10 fragile points - Update AGENTS.md with 'Cross-Repo Architecture Chain Maps' section - Document artifact path gap: OpenClaw tools output to ~/.openclaw/media/ but plugin export scans tasks/<session>/<run>/ * fix(webrtc): resolve remote desktop black screen by properly binding remote video tracks and removing legacy Plan B constraints * fix: remaining webrtc stream and test artifact changes * fix(arch): A1-A3 app layer anti-patterns cleanup * fix(arch): conservative fallback for gateway error codes * fix: merged cleanup branch and stashed fixes * add design doc: multi-session-plugin-optimization * fix: allow stopping archived tasks * fix: sync openclaw terminal snapshots in app * fix: resolve openclaw partial artifacts and eliminate legacy fallback code * fix(assistant): clear pending tool calls when task completes to fix sticky running status * refactor: Remove OpenClaw rigid time limits and false positive no-exported-artifacts judgment * fix(ci): keep macos/ios build lanes running when Apple signing secrets are missing The release preflight used to set should_build_platform=false whenever any Apple signing secret was unset, which silently skipped the entire macos dmg and ios ipa lanes (build + upload gated on that flag). Result: releases only shipped linux, windows and android artifacts even when the iOS/macOS lanes were otherwise healthy. Make the preflight always release the lane, but emit a :⚠️: and annotate the skip_reason when a secret is missing. The iOS branch in build_matrix_artifacts.sh now picks the signed vs unsigned build path based on actual secret availability instead of should_release alone, so it falls back to flutter build ios --no-codesign + zip Runner.app whenever a secret is absent. package-flutter-mac-app.sh already handled the no-secret case locally (ad-hoc codesign --sign -) and needs no change. Behavior matrix: macos: secret present -> signed DMG; secret missing -> unsigned DMG ios: secret present + release -> signed IPA secret present + non-release -> unsigned zip secret missing (any) -> unsigned zip * fix(chat): drop root-level expectedArtifactDirs to satisfy chat.send schema - Remove the unexpected property at the root of gateway task metadata. Keep the value nested in xworkmateTaskArtifactContract where the OpenClaw chat.send schema allows it (-32002: invalid chat.send params). - Drop dead local vars and the unused asInt helper in OpenClaw task association parsing. - Remove the obsolete 'sendChatMessage restarts before handling OpenClaw artifact guard results' test superseded by the new terminal artifact failure test. * fix(ci): drop ripgrep dependency from check-no-app-ffi.sh The Flutter verification lane runs on Ubuntu 22.04 without ripgrep installed, so the FFI integration guard silently fell through and printed 'No app-side Codex FFI integration artifacts found' on every run. Replace rg with the POSIX grep -RInE that ships with the runner, keep the same excludes (check-no-app-ffi.sh, Pods, ephemeral, build, .dart_tool) and emit the actual offending matches so the gate fails loudly when a forbidden reference reappears. * Document OpenClaw artifact dirs protocol boundary * feat: pass OpenClaw artifact dir whitelist * Remove Patrol from macOS package * Add OpenClaw thin adapter refactor plan * refactor/app-thread-key * refactor: explicitly pass openclawSessionKey in task start * Refactor OpenClaw task integration as thin adapter * refactor: align OpenClaw session key state flow * chore: retire rust ffi scaffold * docs clarify openclaw artifact workspace ownership * ci: read release secrets from vault * fix: merge workflow env blocks * fix: skip remote contract on push * fix: align OpenClaw task key flow * chore: retrigger workflow after vault data setup * fix: backfill OpenClaw artifacts on sidebar refresh * fix: trim OpenClaw task prompt context * fix: keep OpenClaw artifact sync polling * fix: require OpenClaw artifact export before completion * fix: unify bridge auth token for desktop connect * fix: keep bridge token usable after sync block * fix: accept review bridge token from account sync * fix: keep syncing partial OpenClaw artifacts * Improve assistant task UX * Sync artifact sidebar with selected task * fix: show remote desktop first-frame state * chore: log remote desktop WebRTC stats * Stabilize OpenClaw artifact sync * Add AI workspace management provisioning flow * Fix gateway dispatch test pipeline * Harden workspace prechecks * Add AI workspace management provisioning flow * Fix gateway dispatch test pipeline * Harden workspace prechecks * Relax workspace OS checks and add YAML import/export * Relax workspace OS checks and add YAML import/export * Make workspace advanced configs extensible * Make workspace advanced configs extensible * Clarify bridge DNS precheck message * Clarify bridge DNS precheck message * Relax workspace prechecks and add post-deploy validation * Relax workspace prechecks and add post-deploy validation * Improve workspace status summary wording * Improve workspace status summary wording * Add default bridge save action * Add default bridge save action * fix: isolate remote desktop webrtc sessions * fix: isolate remote desktop webrtc sessions * fix: smooth remote desktop input over webrtc * fix: smooth remote desktop input over webrtc * feat: align workspace ready actions and naming * feat: align workspace ready actions and naming * fix: clear desktop first-frame overlay after decode * fix: clear desktop first-frame overlay after decode * fix: use renderer first-frame signal for desktop video * fix: use renderer first-frame signal for desktop video * fix: split desktop mouse move data channel * fix: split desktop mouse move data channel * fix(app): bound OpenClaw artifact sync polling * chore: remove stale Flutter code * feat(assistant): include attachment source paths in gateway prompts * chore(desktop): remove advanced options panel * fix(desktop): bound WebRTC offer wait * feat(workspace): run remote setup script * fix: prioritize managed bridge sync state * feat: add explicit gateway task case hints for openclaw-gateway-e2e-regression * fix(settings): update account panel and assistant connection state * fix: preserve primary bridge auth token * test: ignore transient cleanup races * fix: allow unsigned macos CI packaging * fix: support macos validation on bash 3 * chore: temporarily disable desktop ai workspace * ci: move remote_contract to test gate between build and release Reposition the remote provider contract check as a skippable test-stage quality gate (needs: build, continue-on-error) so it can never block build or release. release uses always() to wait without being gated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * chore: update macOS deployment target to 14.0 and commit pending changes * fix(gateway): day-1 stability — stop infinite "running" and un-stoppable tasks Symptom: a gateway turn shows "任务运行中..." forever and 停止 has no effect, even though the OpenClaw gateway has already finished (ACP_HTTP_CONNECTION_CLOSED). - T3: add a hard deadline to the running-handle poll branch so the client no longer polls forever when tasks.get keeps returning "running". Budget is derived from taskLoadClass (10/30/60min, aligned with the bridge) + grace; on timeout the turn lands in a recoverable `interrupted` state (OPENCLAW_RUN_POLL_TIMEOUT) prompting the user to resend. - T4: make 停止 locally authoritative — capture the association, mark the turn aborted immediately (clears pending, exits the poll loop), then fire tasks.cancel best-effort so a hung/failed cancel RPC can't block termination. - T6: applyGatewayChatFailureInternal now authoritatively clears the pending flag (both raw + normalized key). Previously runOpenClawGatewayQueuedTurnInternal's finally never cleared it, leaving "error shown but still running". Full cross-repo analysis + remaining TODO in docs/cases/06. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(gateway): harden OpenClaw task recovery tests * docs(cases/06): mark T7/T8/T9 done with impl locations & design trade-offs Records the durable per-session run-registry implementation (bridge branch fix/gateway-durable-run-registry): T7 gateway-unconfirmed fallback, T8 terminal result cache, T9 DeadlineAt interrupt — with the trade-offs (no gatewayruntime pending-map rewrite; per-session in-memory store not yet cross-restart durable; T9 only force-terminates when the gateway is unconfirmed) and the test names that cover each. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(cases): record local bridge runtime validation * ci: refresh app workflows for node 24 * test: keep layered flutter tests aligned with repo * test: align gateway recovery expectations * test: stabilize assistant gateway recovery cases * docs(cases/06): record definitive root cause — xworkmate.* gateway protocol drift Adds the 2026-06-26 decisive finding: the bridge forwards `xworkmate.*` method names the OpenClaw 2026.6.2 gateway does not implement (it uses native tasks.get/list/cancel and artifacts.list/get/download). Documents the corrected end-to-end turn timeline with the three break points (tasks.get unknown method; {taskId}-only param shape + taskId!=runId; artifacts.* drift blocking .md delivery), the evidence (gateway source + schema + CHANGELOG), the implemented task-lifecycle fix, and the precisely-specified remaining work (artifact-method alignment + test fixture migration). Corrects the earlier (wrong) "push/pull mismatch" conclusion. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(cases/06): correct root cause — plugin not loaded, not protocol drift Live verification disproved the earlier "xworkmate.* protocol namespace drift" conclusion. The xworkmate.* gateway methods are REAL — registered at runtime by the openclaw-multi-session-plugins plugin (index.ts registerGatewayMethod). The actual failure: the running OpenClaw gateway did not load that plugin because its source path was the ephemeral /private/tmp/openclaw-multi-session-plugins/... and the gateway booted (09:21) ~9h before those files were populated (18:40), so it started with 5 plugins (no multi-session) and every xworkmate.* returned "unknown method". Restarting the gateway loads 6 plugins and the methods work (errors shift to plugin-level param validation). Changes: - Add a corrected conclusion banner up top distinguishing the primary root cause (plugin load) from the T1-T9 robustness hardening. - Replace the wrong "protocol drift / native alignment" section with the plugin-not-loaded root cause + evidence + the abandoned-branch note (fix/gateway-task-protocol-alignment must NOT be merged). - Fix failure-row 10, T13 (runtime-state check now covers gateway plugin load), and the landing-order to put the plugin fix as step 0. - Cross-reference openclaw-gateway-e2e-regression/ROOT_CAUSE_ANALYSIS.md (which was already correct about the 4-layer chain). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(cases/06): definitive 4-layer chain incl. multi-session plugin + live verification Rewrites the timeline (§1) and topology (§2) as the correct FOUR-layer chain App → bridge → openclaw-multi-session-plugins → OpenClaw gateway, and documents the plugin's multi-session/multi-thread role: session mapping (appThreadKey⇄openclawSessionKey), per-(session,run) artifactScope = tasks/<sanitize(sessionKey)>/<runId>, the strict sessionKey/runId/artifactScope triplet validation, and the expectedArtifactDirs workspace-root fallback scan. Live-verified against 127.0.0.1:8787 (plugin loaded, commit 2333c3e): - session.prepare returns a real mapping; chat.send returns runId; xworkmate.tasks.get is handled by the plugin but returns no_native_task_record with an empty task scope (chain reaches the plugin layer; the agent run produced no queryable task / no file — a layer-4 execution/landing issue). Adds §7 stability improvements grounded in this live run: - S0 install the plugin from a stable path (not /private/tmp) — the primary reliability fix. - S1 expectedArtifactDirs was [] → the plugin's workspace-root fallback is inert; bridge should always pass default dirs (reports/, artifacts/). - S2 no_native_task_record status ambiguity (running vs completed-without-artifact). - S3 sessionKey/runId/artifactScope triplet consistency (don't pre-prefix agent:main:). - S4 runtime observability across all four layers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(cases/06): S0 done — stable plugin install verified Root cause of the plugin not loading was a symlink ~/.openclaw/extensions/openclaw-multi-session-plugins -> /tmp/... (ephemeral). Replaced with a real dir, registered via `openclaw plugins install --force`, restarted the gateway: now boots with "6 plugins ... openclaw-multi-session-plugins" from the stable path, provenance warning gone, and xworkmate.session.prepare returns the real plugin mapping (no bridge fallback). Survives restart. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(cases/06): mark S1 done — default expectedArtifactDirs (live-verified, bridge 0280893) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(gateway): harden OpenClaw polling and acceptance notes * docs(case06): close out acceptance log * docs(case06): reconcile TODO status + consolidated cross-repo stability backlog - Flip stale §5 checkboxes (T1/T2/T3/T4/T6) to done with code anchors — they had lagged behind §2/§6 which already marked them merged. - Add §9: authoritative full-chain status across all 4 repos' main (app/bridge/openclaw/playbooks HEADs), the completed stability closure, and the precise remaining backlog (S1 redo, S2 status ambiguity, T8b cross-restart persistence) with acceptance criteria + anti-regression recommendations. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(macos): suppress file selector deprecation warning * docs(gateway): map durable agent terminal recovery * docs(cases/06): 4-layer chain full live evaluation — end-to-end PASS Live-verified one gateway turn across all four layers against 8787 (bridge 188ca4b, gateway 6 plugins): session.start → real plugin session.prepare mapping → chat.send → xworkmate.tasks.get returns status=completed, constraintSatisfied=True, and summary.md (438B) actually landed in tasks/<sani(sessionKey)>/<runId>/ and is retrievable via xworkmate.artifacts.export. All xworkmate.* gateway methods ✓. T12 metrics all 0 (no resilience fallback needed). Supersedes the earlier no_native_task_record observation, which was a derived symptom of the plugin not being loaded (the S0 symlink root cause). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(artifacts): route signed downloads through active bridge * fix(prompt): simplify gateway workspace context to avoid conflicting paths (S5) Every gateway turn's prompt prefix injected three near-duplicate absolute paths: currentTaskWorkspace + localWorkspace + remoteWorkspaceHint. localWorkspace is the App's LOCAL thread dir (~/.xworkmate/threads/...) which the gateway agent cannot access, and remoteWorkspaceHint duplicates currentTaskWorkspace. The conflicting paths leave the agent unsure where to work and can block conversation continuation. For gateway turns the prompt now carries only currentTaskWorkspace (the plugin owns the artifact scope); localWorkspace is kept only for non-gateway (local agent runs there); remoteWorkspaceHint is dropped when equal to currentTaskWorkspace. sessionKey is kept (short, not a path). UI is unaffected (chat bubble shows the raw user message; the prompt-debug parser only special-cases Execution context / Preferred skills / Attached files). Tests updated; assistant_execution_target_test green (74). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(macos): close file selector type branch * fix(gateway): keep polling undecorated running snapshots * docs(runbooks): record gateway turn stability case * fix(artifacts): prioritize PDF deliverables in sidebar * fix(artifacts): prioritize PDF deliverables in sidebar Co-authored-by: Haitao Pan <haitao.pan@xworkmate.ai> * docs(cases): add gateway turn acceptance summary * ci: add release/* branch source validation workflow (#19) release/* 仅接受 hotfix/* 或带 cherry-pick/backport 标签的 PR。 详见 iac_modules/docs/tldr-github-branch-model.md Co-authored-by: Haitao Pan <haitao.pan@xworkmate.ai> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * ci: run desktop integration/patrol tests under xvfb (#22) Headless Linux runners have no display, so 'flutter test integration_test' fails to launch the GTK app ('The log reader stopped unexpectedly, or never started'). Wrap integration/patrol layers in xvfb-run with a 24-bit screen and install xvfb + mesa DRI driver for headless GL. macOS/local runs are unaffected (no xvfb-run -> command runs directly). Co-authored-by: Haitao Pan <haitao.pan@xworkmate.ai> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * fix: reveal artifact files without blocking * fix: reveal artifact files without blocking (#20) Co-authored-by: Haitao Pan <haitao.pan@xworkmate.ai> * Release/v1.1.5 (#25) * ci: backport release/* source validation workflow to release/v1.1.5 (#21) 让现有 release/v1.1.5 分支自身包含门禁 workflow(pull_request_target 用 base 分支版本)。 详见 iac_modules/docs/tldr-github-branch-model.md Co-authored-by: Haitao Pan <haitao.pan@xworkmate.ai> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * fix: reveal artifact files without blocking (#24) Co-authored-by: Haitao Pan <haitao.pan@xworkmate.ai> --------- Co-authored-by: Haitao Pan <haitao.pan@xworkmate.ai> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * chore: update tested linux labels (#23) Co-authored-by: Haitao Pan <haitao.pan@xworkmate.ai> * chore: sync app version to 1.1.5 --------- Co-authored-by: Haitao Pan <manbuzhe2009@qq.com> Co-authored-by: Cowork 3P <cowork-3p@localhost> Co-authored-by: Haitao Pan <haitao.pan@xworkmate.ai> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * fix(assistant): keep manual bridge usable when signed out of svc.plus The gateway connection resolver short-circuited to "请先登录 svc.plus" whenever the account was signed out, before checking whether a manual bridge was configured or whether capability discovery was still running. A saved manual bridge could therefore never be used while signed out. - Only emit the signed-out prompt when neither an account session nor a manual bridge is configured (`!accountSignedIn && !bridgeConfigured`). - Gate the sync-blocked branch on `accountSignedIn` so it no longer hijacks the manual-bridge discovery path. Adds tests covering manual-bridge discovery and discovery-failure while signed out. See docs/cases/manual-bridge-login-state/README.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * security(docs): remove plaintext review credentials, inject from .env The svc.plus review password and the two bridge tokens were committed in plaintext across the manual case / API test docs. Replace every value with a `.env` / secret-store reference and add a tracked .env.example template. Harden .gitignore (.env.*, *.local.env, secrets.env) while keeping !.env.example. Note: git history was rewritten separately to purge the leaked values; the credentials must be rotated regardless. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * chore(security): add gitleaks config allowlisting vendored/test fixtures Suppress false positives so `gitleaks detect` is clean: - third_party/* (cargokit ships a public binary-verification key) - workspace_management_unit_test.dart (obfuscated "token" fixture) - gatewayruntime/runtime_test.go (hardcoded "device-1" test key pair) Real leaked secrets are purged from history, not allowlisted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(assistant): keep manual bridge usable when signed out of svc.plus The gateway connection resolver short-circuited to "请先登录 svc.plus" whenever the account was signed out, before checking whether a manual bridge was configured or whether capability discovery was still running. A saved manual bridge could therefore never be used while signed out. - Only emit the signed-out prompt when neither an account session nor a manual bridge is configured (`!accountSignedIn && !bridgeConfigured`). - Gate the sync-blocked branch on `accountSignedIn` so it no longer hijacks the manual-bridge discovery path. Adds tests covering manual-bridge discovery and discovery-failure while signed out. See docs/cases/manual-bridge-login-state/README.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * chore(security): remove historical secret fixtures * ci(release): load Vault secrets per-platform in build matrix The build matrix loaded all 17 signing secrets in one shared block for every platform. vault-action's ignoreNotFound only suppresses path-level 404s, not field-level "No match data" errors, so a single missing field (e.g. APPLE_MAC_PROVISION_PROFILE_BASE64) failed every leg — including linux/windows/android that need no Apple secrets. Split the load into per-OS-family steps gated by matrix.platform: - Apple (macos/ios): Apple cert + provisioning + keychain + export method - Windows: WINDOWS_PFX_* + codesign subject - Android: ANDROID_KEYSTORE_* + key alias/password Linux requests nothing. Also drop APP_STORE_CONNECT_* from the build matrix: only testflight_upload.sh consumes them and it runs in the release job, which loads them itself. The build matrix no longer depends on them. Add shell: bash to the Export step (its `{ … } >> $GITHUB_ENV` brace syntax is bash-only and would fail under the default pwsh on windows). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci: load Vault secrets per-platform in build matrix (#43) The build matrix loaded every signing secret in one shared block for all platforms. vault-action's ignoreNotFound only suppresses path-level 404s, not field-level "No match data" errors, so a single missing field failed every leg — including linux/windows/android that need no Apple secrets. Split the load into per-OS-family steps gated by matrix.platform (Apple for macos/ios, Windows, Android); linux requests nothing. Add shell: bash to the Export step (its `{ … } >> $GITHUB_ENV` brace syntax is bash-only and would fail under the default pwsh on windows). Co-authored-by: Haitao Pan <manbuzhe2009@qq.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * feat: add one-line XWorkmate installer (#42) Co-authored-by: Haitao Pan <manbuzhe2009@qq.com> --------- Co-authored-by: Haitao Pan <haitao.pan@xworkmate.ai> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: Haitao Pan <manbuzhe2009@qq.com> Co-authored-by: Cowork 3P <cowork-3p@localhost>
46 KiB
Gateway Turn 稳定性与健壮性|全链路分析与编码改进规划
适用目录:
xworkmate-app/docs/cases/相关仓库(
/Users/shenlan/workspaces/):
ai-workspace-infra/(Caddy 入口、OpenClaw gateway 部署、playbooks)ai-workspace-lab/(xworkmate-appFlutter 客户端、xworkmate-bridgeGo 网桥)ai-workspace-service/(accounts / console / litellm 等旁路服务)触发现象:任务进度条
任务运行中...永不结束、停止无效,实际 OpenClaw gateway 已执行完毕; 伴随报错Bridge 响应读取中断,本轮结果未完成。错误码:ACP_HTTP_CONNECTION_CLOSED。
结论速览(2026-06-26 live 验证后)
- 「采集AI资讯无法产出」的主根因=OpenClaw 网关启动时未加载
openclaw-multi-session-plugins插件(插件文件在临时路径,网关启动早于文件就位),导致xworkmate.*网关方法全部unknown method。重启网关即恢复(详见 §4「2026-06-26 决定性根因」)。- 本文 §3/§5 的 T1–T9 是健壮性加固(入口超时、客户端 deadline/停止、bridge 持久 run 仓),用于让上述故障表现为「有界可恢复」而非「无限运行/丢结果」——它们正确且应保留,但不是该故障的根因修复。
- ⚠️ 作废:早期「
xworkmate.*协议命名空间漂移、需 bridge 改原生tasks.get」的判断错误,fix/gateway-task-protocol-alignment分支不要合并。
1. The full chain (one gateway turn) — 四层,含 openclaw-multi-session-plugins
一次 gateway turn 跨四层:App → bridge(Go) → openclaw-multi-session-plugins(TS 插件) → OpenClaw gateway runtime。
插件是关键中间层:它给「多会话/多线程」补上逻辑隔离的 artifact scope 与会话映射 + 任务快照,并以 xworkmate.* 网关方法暴露给 bridge。
App.executeTask ──SSE POST /acp/rpc──▶ Bridge.handleRequest
│ │
│ ① xworkmate.session.prepare ───────▶ Gateway ──▶ [plugin] recordXWorkmateSessionMapping
│ (建 appThreadKey⇄openclawSessionKey 映射 + prepareXWorkmateArtifacts)
│ ◀── { artifactScope: tasks/<sani(sessionKey)>/<runId>, artifactDirectory, expectedArtifactDirs }
│ ② chat.send (gateway 原生) ────────▶ Gateway ──▶ 派发 agent run,返回 runId(~25ms,detached)
│ ◀── { runId, status: started }
▼
Bridge 记 sess.openClaw + DeadlineAt(budget by taskLoadClass),返回 running 句柄
│
▼ App 持久化 association,pollOpenClawTaskAssociationInternal 轮询:
③ xworkmate.tasks.get(runId/openclawSessionKey/artifactScope) ─▶ Gateway ─▶ [plugin] getXWorkmateTaskSnapshot
├─ resolveNativeTask(host task registry by runId) ──有──▶ 回 status(running/completed/failed) + 经 export 取 artifacts
└─ 无 native task ──▶ exportArtifactsForTaskLookup 兜底:扫 scope 目录 + expectedArtifactDirs(workspace 根 reports//artifacts/)
├─ 有产物 ─▶ status=unknown, evidence=artifacts_present, 带 artifacts
└─ 无产物 ─▶ code=no_native_task_record / task_not_found
④ 终态后取产物:xworkmate.artifacts.export / .list / .read(插件 exportXWorkmateArtifacts)
scopeRoot = workspaceRoot/tasks/<sani(sessionKey)>/<runId>;按 requiredArtifactExtensions(.md) 判 constraintSatisfied
多会话/多线程隔离的核心(插件提供):
artifactScopeFor(sessionKey, runId)=tasks/<sanitize(sessionKey)>/<runId>(冒号→下划线,如agent:main:draft:e2e→agent_main_draft_e2e),每个 (会话,运行) 独立目录,互不串扰(exportArtifacts.ts:126/164)。recordXWorkmateSessionMapping把appThreadKey ⇄ openclawSessionKey持久成会话扩展(taskState.ts),让跨连接/重连仍能按 appThreadKey 找回 run。exportXWorkmateArtifacts严校requestedArtifactScope === artifactScopeFor(sessionKey,runId),不匹配抛artifactScope does not match sessionKey/runId——调用方必须传一致的 sessionKey/runId/artifactScope 三元组(bridge 的taskGetParamsWithSessionScope负责从 session 记录补齐;外部手工探针易踩此坑)。- workspace 根兜底扫描:agent 常把产物写到 workspace 根的
reports/、artifacts/而非 task scope;插件用expectedArtifactDirs回扫这些目录纳入产物(见openclaw-gateway-e2e-regression/ROOT_CAUSE_ANALYSIS.mdFix 0)。
live 实测(2026-06-26,8787,插件已加载):① session.prepare 回真实 mapping ✓;② chat.send runId=turn-…438450000 ✓(bridge 日志 request_timing method=chat.send);③ tasks.get 经插件返回 code=no_native_task_record(mapping 在、但 gateway 无该 run 的 native task record)+ scope 目录空、无 news.md。即链路打通到插件层,但本轮 agent 未注册可查 task、也未落产物(属第 4 层 agent 执行/落盘问题,见 §4「残留」与 §7)。
关键代码锚点:
| 环节 | 位置 |
|---|---|
| App 发起 SSE turn / 轮询 | xworkmate-app/.../app_controller_desktop_thread_actions.dart:644、:747 |
| Bridge session.prepare / tasks.get / cancel | xworkmate-bridge/internal/acp/rpc_handler.go:96、:126、taskGetParamsWithSessionScope:177 |
| Bridge↔Gateway WebSocket(dial 18789) | xworkmate-bridge/internal/gatewayruntime/runtime.go:372 |
插件注册 xworkmate.* 网关方法 |
openclaw-multi-session-plugins/index.ts:126/162/176/206/221(session.prepare/tasks.get/artifacts.export/.list/.read) |
| 插件 artifact scope / 会话映射 / 任务快照 | src/exportArtifacts.ts、src/taskState.ts:208 getXWorkmateTaskSnapshot |
2. 端到端拓扑(实测自代码 + live 8787 验证)
┌─ ai-workspace-lab/xworkmate-app (Flutter) executeTask → SSE POST Accept: text/event-stream
▼
┌─ ai-workspace-infra Caddy 入口 xworkmate-bridge.svc.plus (本机直连 127.0.0.1:8787 不经 Caddy)
│ /acp* : flush_interval -1 ✓ read/write_timeout 70m(对齐 bridge 60min) keepalive 5m ← T1/T2 已修
▼
┌─ ai-workspace-lab/xworkmate-bridge (Go) live commit 2333c3e
│ /acp/rpc handleRequest → session.prepare / chat.send / xworkmate.tasks.get / tasks.cancel
│ gatewayruntime: WebSocket → 127.0.0.1:18789;per-session 持久 run 仓(T7/T8/T9)
▼ (WS, gateway 原生 chat.send + 插件注册的 xworkmate.*)
┌─ ai-workspace-lab/openclaw-multi-session-plugins (TS 插件, enabled) ← 多会话/多线程 artifact 隔离层
│ index.ts registerGatewayMethod: xworkmate.session.prepare / .tasks.get / .artifacts.export/.list/.read/.collect-and-snapshot
│ 会话映射(taskState) + artifactScope=tasks/<sani(sessionKey)>/<runId>(exportArtifacts) + workspace 根兜底扫描
▼ (插件运行在网关进程内)
┌─ OpenClaw gateway runtime npm-global openclaw 2026.6.1 @ /opt/homebrew/lib/node_modules/openclaw
│ launchd ai.openclaw.gateway, WS 18789;host task registry(detached task) + agent 执行(deepseek-v4-flash)
▼
workspace ~/.openclaw/workspace/tasks/<sani(sessionKey)>/<runId>/ → 产物(.md 等)
四层完整 live 评估(2026-06-27,8787 commit 188ca4b)— 端到端 PASS ✅
一次 gateway turn(sessionId=draft:eval-…,prompt「总结今天AI新闻并保存到 summary.md」)逐层实测:
| 层 | 跳 | 验证方式 | 实测结果 |
|---|---|---|---|
| L1 App→Bridge | session.start(target=gateway) |
/acp/rpc |
✅ 回 running 句柄:runId、openclawSessionKey=agent:main:draft:eval-…、artifactScope=tasks/agent_main_draft_eval-…/turn-…、budget=30min(long_task) |
| L2 Bridge→Plugin | xworkmate.session.prepare |
bridge log + payload | ✅ 插件 recordXWorkmateSessionMapping + prepareXWorkmateArtifacts;网关 res ✓ xworkmate.session.prepare 569ms |
| L2→3 Bridge→Gateway | chat.send(原生) |
bridge/gateway log | ✅ 派发 agent,runId=turn-1782533120263890000 |
| L3 Plugin tasks 快照 | xworkmate.tasks.get |
/acp/rpc |
✅ getXWorkmateTaskSnapshot → status=completed, success=True, constraintSatisfied=True、hasMapping=True、无 degraded |
| L3↔4 插件↔网关 | xworkmate.* 全家 |
网关 [ws] ⇄ res |
✅ session.prepare ✓ / artifacts.export ✓ 70ms(×2) / tasks.get ✓ 63ms |
| L4 Gateway | 6 plugins + agent 执行 | 网关启动日志 | ✅ listening (6 plugins: … openclaw-multi-session-plugins);agent(deepseek-v4-flash) 产出 |
| 产物 | 落盘 + 取回 | tasks.get artifacts + fs | ✅ summary.md text/markdown 438B 落在 ~/.openclaw/workspace/tasks/agent_main_draft_eval-…/turn-…/,并经 xworkmate.artifacts.export 回到 tasks.get 的 artifacts[] |
结论:S0 把插件从稳定路径装好后,一次 gateway turn 全链路打通 —— task 到 completed、md 正确落在 task scope、constraintSatisfied=True、产物可经 tasks.get 取回;T12 三项指标全 0(本轮无需任何 resilience 兜底)。此前 no_native_task_record / 「暂无文件」均为插件未加载(§4 符号链接根因)的派生症状,已消除。
旁路(ai-workspace-service):accounts.svc.plus(登录/Token)、console.svc.plus(openclaw assistant route)、litellm/AI-Relay-Kit/codex-relay(模型出口)、qmd(记忆)——不在主链,但 Token/出口异常以「连接中断」形态出现,用 runId 区分。
出处:Caddy …/xworkmate_bridge/templates/xworkmate-bridge-site.caddy.j2;网关部署 deploy_gateway_openclaw.yml;插件 ~/.openclaw/extensions/openclaw-multi-session-plugins(注册源当前指向临时 /private/tmp/…,见 §4 加固项)。
3. 跨层失效点清单(按链路从上到下)
| # | 层 | 失效描述 | 证据 |
|---|---|---|---|
| ① | App | running 轮询分支无 deadline / 无 maxAttempts;只要 tasks.get 一直回 running 就永远轮询 |
app_controller_desktop_thread_actions.dart:788-794(对比 :831 的 artifact-sync 分支有封顶 openClawArtifactSyncLimitReachedInternal) |
| ② | App | 进度条 phase 仅由 pending 驱动,而 pending 被该 loop 独占持有,loop 不退出 = 永远 任务运行中... |
assistant_task_progress_bar.dart:190;pending 源 = aiGatewayPendingSessionKeysInternal |
| ③ | App | 停止(abortRun → cancelAssistantTaskForSessionInternal)只向 gateway 发 tasks.cancel,从不本地清 pending;gateway 不可达 / run 已丢时为空操作 → 永远停不下 |
app_controller_desktop_thread_actions.dart:1793-1809 |
| ④ | 入口 / Bridge 超时错配 | Caddy /acp* read/write_timeout 30m,但 bridge openClawAgentWaitMaxTimeout = 60min。>30min 的 complex_chain_task,SSE 在入口被掐断,而 gateway 仍在跑 → ACP_HTTP_CONNECTION_CLOSED |
caddy.j2 read_timeout 30m vs orchestrator.go:32 |
| ⑤ | 入口 | 仅 /acp* 有 flush_interval -1 与长超时;/api*、/ 用 Caddy 默认(短超时、无即时 flush),任何走这两条的流式 / 轮询都更脆 |
caddy.j2 |
| ⑥ | Bridge↔Gateway WS 关联绑定连接 | WS 一抖,onConnLost 把所有在途请求立即判 SOCKET_CLOSED——包括 OpenClaw 仍在执行的那个 run;重连后无任何 pending 复关联 |
gatewayruntime/runtime.go:802-823、:935-938(takePendingLocked 清空) |
| ⑦ | Bridge 无持久 run 仓 | 同步 SSE 路径不落 xworkmate.jobs.*,结果只活在内存 responseCh;连接一断即丢(已存在 jobs.submit/get/list 未被 gateway submit 复用) |
rpc_handler.go:73(jobs.*)vs http_handler 同步路径 |
| ⑧ | Gateway 重连丢 run 态 | 重连到新 WS 后 tasks.get 经 ensureProductionGatewayConnected 落到新连接,查不到 terminal → 回 stale running 或 not_found |
rpc_handler.go:143-175 |
| ⑨ | Bridge | gatewayRPCError 把可重试错误统一映射为 OPENCLAW_GATEWAY_SOCKET_CLOSED,但缺少"run 仍在后台、稍后可查"的语义,客户端只能当硬失败 |
orchestrator.go:1678-1702 |
| ⑩ | 运行态 ≠ 源码(主根因,见 §4) | 报 unknown method: xworkmate.* 时,根因通常是网关未加载 openclaw-multi-session-plugins 插件(这些方法由插件注册);次因是本机 launchd 的 bridge 二进制为旧构建 |
网关启动日志 N plugins 是否含 openclaw-multi-session-plugins;openclaw plugins inspect;curl /api/ping 的 commit |
4. 根因链(Root cause)
"实际 gateway 已执行完毕,但任务永远停不下" =
服务端把已完成的结果弄丢(⑥ + ⑦ + ⑧) :bridge↔gateway 的请求关联随 WS 连接销毁、且无持久副本,run 完成事件投递到已被放弃的 channel;重连后无法复关联。
+ 客户端把"无限 running"固化(① + ② + ③) :轮询无截止、进度条只看 pending、停止不本地生效。
+ 触发器(④ / ⑤) :长任务在入口 30min 处必断,把链路推向上述失效(尤其 complex_chain_task budget=60min)。
最小可复现路径:
- 提交一个 budget > 30min 或网络抖动概率高的 gateway 任务;
- SSE 在入口(30min)或 WS(瞬断)处断开 → App 收
ACP_HTTP_CONNECTION_CLOSED; - OpenClaw 后台继续跑完,但 bridge 已丢失该 run 的 pending / 无持久记录;
- App 落入 running 轮询(或失败后仍 pending),
tasks.get拿不到 terminal; - 进度条永远
任务运行中...,停止不本地生效 → 卡死。
2026-06-26 本机联合调试结论
本次现场环境:
- App:
Version 1.1.4,应用构建 commitfb7e0ac。 - 本地 Bridge:
http://127.0.0.1:8787,launchd 服务plus.svc.xworkspace.bridge。 - OpenClaw gateway 控制台:
http://127.0.0.1:18789/channels。
实际复现:
curl -sS -X POST http://127.0.0.1:8787/acp/rpc \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer $BRIDGE_AUTH_TOKEN' \
--data '{"jsonrpc":"2.0","id":"probe","method":"xworkmate.session.prepare","params":{"openclawSessionKey":"probe-session","runId":"probe-run","workspaceDir":"/tmp/xworkmate-probe","gatewayProviderId":"openclaw"}}'
修复前返回:
{"error":{"code":-32601,"message":"unknown method: xworkmate.session.prepare"},"ok":false}
结论:这不是 OpenClaw gateway 控制台 18789 的页面问题,也不是 App 展示误判;/acp/rpc 上 acp.capabilities 正常而 xworkmate.session.prepare 不认识,说明本地 Bridge 运行态没有加载到包含 handleSessionPrepare 的新二进制。源码中 rpc_handler.go 和 orchestrator.go 已有 fallback,但 /usr/local/bin/xworkmate-go-core 仍是旧/无元信息构建。
现场处理:
- 在
xworkmate-bridge当前源码重建xworkmate-go-core。 - 备份并替换
/usr/local/bin/xworkmate-go-core。 - macOS 上移除
com.apple.provenance/ quarantine 并codesign --force --sign -,否则 launchd 会以OS_REASON_CODESIGNING拒绝启动。 - 重载
plus.svc.xworkspace.bridge。
修复后探针:
{
"ok": true,
"payload": {
"fallback": true,
"compatibilityMode": "local-session-prepare",
"artifactScope": "tasks/probe-session/probe-run",
"artifactDirectory": "/tmp/xworkmate-probe/tasks/probe-session/probe-run"
}
}
回归要求:每次本地替换 Bridge 后,先用 /api/ping 确认 commit,再用上面的 xworkmate.session.prepare 探针确认返回 ok:true;不能只看 App 设置页的 Status: ok。
2026-06-26 决定性根因(已 live 验证):OpenClaw gateway 未加载 openclaw-multi-session-plugins 插件
四层调用链(与
openclaw-gateway-e2e-regression/ROOT_CAUSE_ANALYSIS.md一致):App → xworkmate-bridge(Go) → openclaw-multi-session-plugins(TS 插件) → OpenClaw gateway runtime(127.0.0.1:18789)
xworkmate.* 系列网关方法不是「虚构/漂移」的——它们由 openclaw-multi-session-plugins 插件在运行时注册:
index.ts 里 api.registerGatewayMethod("xworkmate.session.prepare" | "xworkmate.tasks.get" | "xworkmate.artifacts.export" | ".list" | ".read" | ".collect-and-snapshot")。所以 bridge 发 xworkmate.tasks.get 是正确契约,前提是该插件被网关加载。
现场实际故障:运行中的网关没有加载这个插件,于是所有 xworkmate.* 都回 unknown method。证据链:
- 网关启动日志:
2026-06-23前每次都是listening (6 plugins: … openclaw-multi-session-plugins);2026-06-26 09:21:14那次只有5 plugins,缺了 multi-session。 - 插件注册的源路径是临时目录
/private/tmp/openclaw-multi-session-plugins/dist/index.js;该文件直到18:40才被填充——晚于网关 09:21 启动约 9 小时。启动时路径不存在 → 插件未加载。 openclaw plugins inspect警告:loaded without install/load-path provenance; treat as untracked local code(无 install 记录的本地代码)。
修复(已 live 验证通过):launchctl kickstart -k gui/$UID/ai.openclaw.gateway 重启网关(此时插件文件已就位)→ 日志变 listening (6 plugins: … openclaw-multi-session-plugins);xworkmate.* 方法的报错从 unknown method 变为插件内部参数校验(xworkmate.session.prepare → appThreadKey required、xworkmate.tasks.get → artifactScope does not match sessionKey/runId)——证明方法已注册、插件正在处理请求。
⚠️ 更正:此前一版本文档曾判为「bridge↔gateway
xworkmate.*协议命名空间漂移、需在 bridge 侧改原生tasks.get/artifacts.*」——该结论错误,源于当时未发现openclaw-multi-session-plugins插件提供这些方法。对应的fix/gateway-task-protocol-alignment分支(native 重命名)作废、不得合并;bridge 原有xworkmate.*协议是对的。
残留与加固项
- 插件安装路径必须稳定:当前注册在
/private/tmp/…(重启 / tmp 清理即丢,正是本次故障诱因)。应用openclaw plugins install <stable-path>从~/.openclaw/extensions/openclaw-multi-session-plugins/或仓库路径重装,落正式 install 记录,避免再次「启动早于插件就位」。 - 会话面:现场 console 的
…:dashboard:bcde1b0f…与提交的…:draft:…不同面;task 建在draft(requesterSessionKey),dashboard 仅是 console 自带会话视图,非断点。 - 手工探针注意:直接构造
tasks.get时sessionId不要预带agent:main:前缀——bridge 会再加一层导致agent:main:agent:main:…双前缀,触发插件的artifactScope does not match sessionKey/runId。app 正常路径传draft:<id>,bridge 补一层。
5. 编码改进规划(Stability / Robustness TODO)
排序原则:先当天可上、零协议变更的止血项(L0 配置 + L2 客户端),再治本的服务端持久化(L1),最后横切可观测性(L3)。 每项标注:所属仓库 · 改动面 · 验收要点。
L0 入口配置(ai-workspace-infra · 改配置即可 · 当天可上)
-
T1 对齐入口与 bridge 的超时上限 — ✅ 已实现并加断言(playbooks)
xworkmate-bridge-site.caddy.j2全部 route 的read_timeout/write_timeout由单一变量xworkmate_bridge_acp_stream_timeout(defaults/main.yml:36=70m≥ bridgeopenClawAgentWaitMaxTimeout60min + 余量)渲染,消除入口/bridge 漂移;tasks/validate.yml:39-41断言渲染结果含该超时。 验收:budget=60min 的 complex_chain_task 全程 SSE 不被入口掐断。✓ -
T2 收敛 / 补齐非 /acp 路由的流式配置 — ✅ 已实现(playbooks)
/api*、/artifacts/*、/acp*、/四条 handle 现统一带flush_interval -1+ 同一xworkmate_bridge_acp_stream_timeout(caddy.j2:16-60),不再有任何路由回落 Caddy 默认短超时。 验收:轮询 / 流式不再依赖 Caddy 默认短超时。✓
L2 App 客户端止血(ai-workspace-lab/xworkmate-app · 纯客户端 · 零协议变更)
-
T3 running 轮询加硬截止 — ✅ 已实现(xworkmate-app) running-handle 分支经
openClawRunningPollDeadlineReachedInternal(thread_actions.dart:799)按 bridge 下发的 deadline + grace 封顶;到点落interrupted(可恢复态,isRecoverableAssistantTaskStateInternal已支持)并退出 loop。 验收:gateway 始终回running时,客户端在 deadline 后必终止,不再无限轮询。✓ -
T4
停止本地权威化 — ✅ 已实现(xworkmate-app)abortRun(thread_actions.dart:1819)先乐观清aiGatewayPendingSessionKeysInternal、置终止态、退出 loop,再 best-effort 发tasks.cancel(注释:「其失败/挂起都不得阻塞或回滚本地终止」)。UI 终止不依赖 gateway 往返。 验收:gateway 不可达时点停止仍能立刻停下。✓ -
T5 传输中断降级为"后台续跑·重连中" —
pollOpenClawTaskAssociationInternalcatch(thread_actions.dart) 轮询期间 App↔bridge 传输瞬断(ACP_HTTP_CONNECTION_CLOSED)时,不硬失败丢结果,而是有界重试续轮询:连续瞬断< kOpenClawPollTransientRetryLimit(=5)则保持 running、2s 后重试下一次getTask;每次成功重置计数;超限才落终态。bridge 侧 T7/T9 负责网关侧抖动,这里只兜 App↔bridge 这一跳。 取舍:未引入新的「重连中」UI 相位(避免改进度条布局);任务保持「运行中」即降级态。未走resumeOpenClawTaskAssociationsInternal全量恢复(那是 thread 重载路径),而是就地有界重试,风险更低、不会无限重连。 验收:轮询瞬断 ≤5 次能自动续轮询;持续不可达则在有界次数后落终态。 -
T6 失败路径与 pending 清理一致性 — ✅ 已实现(xworkmate-app) 失败 / 中断 / 取消三条终态路径均确定性清
aiGatewayPendingSessionKeysInternal(thread_actions.dart:716/903/932/1657-1658、runtime_helpers.dart:1032/1112),消除"错误已渲染但仍 running"竞态。 验收:失败 / 中断 / 取消三类终态后,pending必为 false。✓
L1 Bridge 持久化(ai-workspace-lab/xworkmate-bridge · 根因修复 · 有协议/状态面改动)
✅ T7/T8/T9 已实现(本地验证 commit
2333c3e,需确保运行态/api/ping.commit与发布 commit 对齐)。 实现取舍:复用已存在的 per-session 持久 store(s.sessions[sessionID]内的task/openClaw/lastResult,生命周期独立于 bridge↔gateway WebSocket),把tasks.get从「强依赖 gateway 应答」改造为「优先用持久 run 仓兜底」。 新增internal/acp/openclaw_run_registry.go(+_test.go),改动rpc_handler.go: handleTaskGet与orchestrator.go: startOpenClawGatewayTask。
-
T7 run 关联与 WS 连接解耦 —
openClawTaskGetGatewayUnconfirmedFallback(openclaw_run_registry.go) gateway 无法确认(unavailable / socket closed / not_found)但 run 仍在预算内时,按runId从持久 session store 合成running句柄让客户端继续轮询,跨越瞬时抖动。 取舍:未直接改gatewayruntime.onConnLost(runtime.go:802)的 pending 判死逻辑——在途请求被判死后会以 gateway error 冒泡到handleTaskGet,新兜底按 runId 续轮询到 deadline 已等价覆盖,且风险远低于重写连接层 pending 关联。chat.send 初次提交若 WS 中断则尚无 runId、无可复关联,客户端重发即可。 验收:WS 瞬断 + 重连后,已完成的 run 仍能被tasks.get查到 terminal + artifacts(见TestGatewayUnconfirmedFallbackWithinBudgetKeepsPolling)。 -
T8 终态结果落持久 run 仓 —
cacheOpenClawTaskGetResultIfTerminal/cachedTerminalForRunLockedgateway 确认终态后,把最终客户端形态(已 decorate 下载 URL + strip 内联内容)缓存进sess.lastResult,后续轮询直接回放,gateway 之后查不到也不丢。带runId校验 + 新 turn 复用 session 时重置ProgressTerminal,防旧 run 终态错配新 run。 取舍:未新建独立xworkmate.jobs.*落库;现阶段复用 per-session 内存 store(已满足「跨 WS 抖动不丢结果」)。bridge 进程重启后仍会丢——若需跨重启持久化,再起 T8b 接jobs.*/ 磁盘。 验收:连接抖动后结果与 artifacts 仍可检索(见TestTerminalResultCachedAndServedAfterGatewayLoss、TestCachedTerminalNotServedForDifferentRunId)。 -
T9 服务端 DeadlineAt 兜底终态 —
markOpenClawRunDeadlineInterruptedLockedrun 过期(sess.task.DeadlineAt)且 gateway 无法确认时,bridge 主动回 terminalinterrupted(OPENCLAW_RUN_DEADLINE_EXCEEDED),与 T3 客户端 deadline 形成双保险。 取舍:仅在 gateway 无法确认时按 deadline 强制终态;gateway 明确回running时不强杀,避免误伤合法长任务(那一侧由客户端 T3 兜底)。 验收:gateway 失联超过 budget 后,tasks.get返回确定 terminal(见TestGatewayUnconfirmedFallbackPastDeadlineInterrupts)。 -
T10 错误语义细化 —
gatewayRPCError(orchestrator.go) 对OPENCLAW_GATEWAY_SOCKET_CLOSED在 Data 中带retryable=true、poll=true,表达「连接断但 run 可能仍在后台、可续轮询」语义,供客户端 T5 据此续轮询而非硬失败。 验收:socket-closed 错误带 retryable/poll 标记。 -
T13 运行态同步校验(bridge 二进制 + 网关插件) 「源码已修但跑的不是它」是反复踩的坑,需双侧确认:
- Bridge:
/api/ping.commit非空且等于目标 commit(本机 launchd 可能仍跑旧/usr/local/bin/xworkmate-go-core)。 - 网关插件:网关启动日志含
… openclaw-multi-session-plugins,且/acp/rpc xworkmate.session.prepare不返回unknown method(插件已加载时返回真实 mapping;未加载时 bridge 才走compatibilityMode=local-session-prepare降级)。 验收:App 不再显示unknown method: xworkmate.*;网关N plugins列表含 multi-session。
- Bridge:
L3 可观测性(横切 · infra/service/lab)
-
T11 端到端贯穿 runId —
openclaw_run_registry.go在tasks_get_unconfirmed_fallback、run_deadline_interrupt两处加runId/openclawSessionKey标记的 warn 日志,可与 App→bridge→插件→gateway 按runId串联(既有component=acp_sse已带 requestId)。 验收:socket 抖动 / deadline 终态在 bridge 日志可按 runId 定位。 -
T12 关键指标 —
internal/acp/metrics.go,经/api/ping.metrics暴露 进程内计数:gatewaySocketClosed、taskGetUnconfirmedFallback、runDeadlineInterrupt。live 验证/api/ping已返回metrics字段(commit0a50621)。 验收:三类不稳定事件可监控,无需靠用户截图。(告警接入留运维侧)
6. 落地顺序建议
- ✅ 主根因修复(live 验证):让 OpenClaw 网关稳定加载
openclaw-multi-session-plugins——openclaw plugins install从稳定路径重装 + 重启网关,确认启动日志6 plugins … openclaw-multi-session-plugins、xworkmate.*不再unknown method。这是「采集AI资讯能产出」的前提(详见 §4)。 - ✅ 当天止血(已合并 main):T1 + T2(入口配置)+ T3 + T4 + T6(客户端)+ session.prepare 数字 code 降级,消除"30min 必断 / 路由漏配 / 无限 running / 停不掉"。 说明:session.prepare 数字 code 降级仍有价值——当插件未加载时,让 bridge 优雅 fallback 而非硬失败;插件加载后走真实 plugin 路径。
- ✅ 健壮性加固(commit
2333c3e):T7 + T8 + T9(bridge 持久 run 仓与 WS 解耦),把网关短暂不可达 / 抖动收敛为「有界续轮询 → deadline 终态」,而非无限运行/丢结果。 - ✅ 断连语义 + 可观测(commit
0a50621):T10(socket-closed 带 retryable/poll)+ T5(App 轮询瞬断有界续轮询)+ T11(runId 日志)+ T12(/api/ping.metrics计数)。 - 剩余:
- S1(已回退,待重做):缺省
expectedArtifactDirs会让「期望产物但实际无产物」的 run 卡在「等待导出」(破坏 E2E 测试)。根因是openClawTaskGetRequiresArtifactExport把「有 expectedArtifactDirs」等同「必须导出/阻塞」。正确做法:解耦「扫描提示」与「阻塞式导出要求」——让缺省目录只驱动插件的兜底扫描、不触发 bridge 的等待导出。需单独一轮、对全 E2E 套件验证。 - T8b(跨进程重启持久化):把 per-session run 仓落磁盘 / 接
xworkmate.jobs.*,让 bridge 进程重启后仍能回放终态。当前内存仓已覆盖「WS 抖动 / 网关瞬断」(同进程内),跨重启是较小边际收益、较大复杂度(序列化 / 启动加载 / 过期清理 / 并发),建议作为独立一轮带测试做。
- S1(已回退,待重做):缺省
回归对照:本目录
00-review-env-and-matrix.md第 2 节"通用验收标准"中"长任务执行期间状态流 / 取消 / 重试稳定""同一任务重复执行 3 次不卡死",即本规划的回归出口。 产物交付链(artifact scope / workspace 路径)的独立缺陷与修复,见openclaw-gateway-e2e-regression/ROOT_CAUSE_ANALYSIS.md。
7. 全链路稳定性改进(基于 2026-06-26 四层 live 验证)
优先级按「直接决定一次任务能否产出」排序。S1/S2 是本轮 live 新发现。
-
S0 插件稳定安装(最高)— ✅ 已落实并验证(2026-06-27) 网关方法
xworkmate.*全部依赖openclaw-multi-session-plugins。精确根因:~/.openclaw/extensions/openclaw-multi-session-plugins是个符号链接 →/tmp/openclaw-multi-session-plugins(临时盘,重启/清 tmp 即失效);openclaw plugins inspect标Source: /private/tmp/…且警告「loaded without install/load-path provenance(untracked local code)」。网关 09:21 启动早于该 tmp 路径就位 → 当次只加载 5 plugins、xworkmate.*全unknown method。 已执行:① 删符号链接、把内容复制成真实目录~/.openclaw/extensions/openclaw-multi-session-plugins;②openclaw plugins install <该路径> --force落正式pathinstall 记录;③launchctl kickstart -k gui/$UID/ai.openclaw.gateway重启。 验证:启动日志http server listening (6 plugins: … openclaw-multi-session-plugins);inspect的Source变为~/.openclaw/extensions/…/dist/index.js、provenance 警告消失;xworkmate.session.prepare经 bridge 返回真实插件响应(fallback=null、带mapping、artifactScope=tasks/draft_s0verify/s0-run),不再走 bridge 的local-session-prepare降级。 收尾:~/.openclaw/extensions/现为真实目录(非 /tmp 软链),重启/重启后不再丢插件;建议把它纳入部署(deploy_gateway_openclaw)从仓库openclaw-multi-session-plugins安装,避免再被软链到临时盘。 -
S1
expectedArtifactDirs为空导致根目录兜底失效 — ⚠️ 一版本已合并后回退(commit0280893→ 回退于81f65e3) 根因:live 的 session mapping 为expectedArtifactDirs:[],而插件对「agent 把产物写到 workspace 根reports//artifacts/而非 task scope」的兜底扫描依赖expectedArtifactDirs;为空 → 兜底形同虚设 → 即便 agent 产出也收不到,表现「暂无文件」。 回退原因:当时的实现给所有「推断出 requiredExts」的任务补缺省目录并置requiresExport=true,导致 gateway run 成功但实际无产物时卡在「等待 artifact 导出」(TestHTTPHandlerGatewayOpenClawHandlesFiveConcurrentE2ECases等转红)。阻塞来自openClawTaskGetRequiresArtifactExport把「有 expectedArtifactDirs」等同「必须导出」。 正确做法(待重做):解耦「扫描提示」与「阻塞式导出」——缺省目录只驱动插件兜底扫描、不触发 bridge 等待导出;或仅在客户端显式声明requiredArtifactExtensions时启用。需单独一轮、对全 E2E 套件验证后再上。 -
S2
no_native_task_record状态歧义 —xworkmate.tasks.get的真值来自「gateway host task registry 有该 run 的 detached task」或「artifact 已存在」。live 中 chat.send 成功但 gateway 无 native task record(agent 可能以 inline chat 执行、未注册可查 task),且无产物 → 插件回no_native_task_record,bridge 只能靠 T7 兜底续轮询到 deadline,无法区分「还在跑」与「跑完没产物」。 改进:①确认 gateway 侧 chat.send 是否应产出 detached task(agent 配置/tasks.*注册);②插件/bridge 在no_native_task_record且超过最小执行时长时,下发更明确的running(no-record)vscompleted(no-artifact)语义,配合 §5 T9 deadline 收口。 验收:agent 正常执行时tasks.get能返回真实 running→completed;异常时给确定终态而非无限 degraded。 -
S3 三元组一致性(已知约束) — 插件严校
sessionKey/runId/artifactScope三者一致(exportArtifacts.ts:126),且 bridge 的 openclawSessionKey 由agent:main:+ appThreadKey 组成。调用方/探针不要预带agent:main:前缀(否则双前缀 →artifactScope does not match)。bridgetaskGetParamsWithSessionScope已负责补齐;保持其为唯一可信来源,App/探针只传sessionId=draft:<id>+runId。 -
S4 运行态可观测 — 沿用 §5 T11/T12:bridge
/api/ping.commit、网关N plugins列表、openclaw plugins inspect三处纳入健康检查;runId贯穿 App→bridge→插件→gateway 日志,便于定位断点落在四层中的哪一层。 -
S5 工作区上下文精简(2026-06-27 已实现)— 避免多个冲突路径阻断对话续跑 ✅ 现象:每个 gateway turn 的 prompt 前缀
TaskThread workspace context同时塞currentTaskWorkspace+localWorkspace+remoteWorkspaceHint三个近似重复的绝对路径,其中localWorkspace是 App 本机线程目录(~/.xworkmate/threads/…),网关 agent 的文件系统根本访问不到;remoteWorkspaceHint又与currentTaskWorkspace重复。多个相互冲突的路径让 agent 不知该在哪工作 → 对话任务可能无法继续执行。 改进(app_controller_desktop_thread_actions.dart taskWorkspaceContextPromptInternal):网关任务只给currentTaskWorkspace一个工作目录(artifact scope 由插件托管);localWorkspace仅在非网关(本地 agent 实际运行于此)保留;remoteWorkspaceHint与currentTaskWorkspace相同则去重。sessionKey保留(短、非路径、不致冲突)。 不破坏 UI:聊天气泡显示原始用户消息,prompt 调试视图只特判Execution context/Preferred skills/Attached files块,工作区块按正文渲染。验收:assistant_execution_target_test.dart全绿(网关 prompt 断言isNot(localWorkspace)/isNot(remoteWorkspaceHint),保留currentTaskWorkspace/sessionKey)。
8. 2026-06-27 Cases 00–05 全面验收执行日志(进行中)
执行计划:
docs/plans/2026-06-27-cases-00-05-gateway-turn-acceptance.md。本节只记录脱敏后的运行证据;API Key、Bridge Token、账号密码不写入仓库。 追溯参考:.xcodeinsight/context/repo-summary.md、.xcodeinsight/index/risk-index.md、.xcodeinsight/index/callchain-index.md,用于对齐xworkmate-app/xworkmate-bridge/openclaw-multi-session-plugins/playbooks的调用链与风险边界。
8.1 当前目标与状态
| 阶段 | 状态 | 当前证据 / 下一步 |
|---|---|---|
| 仓库与运行态基线 | ✅ 已完成 | App 基线 ca9cba6 已纳入回归;本轮修复最终提交为 66fd0e4 |
| 本地 all-in-one 部署 | ✅ 已完成 | 稳定插件目录幂等迁移修复已提交 xworkspace-console main(50c2d85 + 5093e21),本地修复版已验证通过 |
| Gateway Turn 定向回归 | 🟢 已通过 | T5 两条新增定向测试通过;完整 assistant_execution_target_test.dart 74 条通过 |
| Cases 00–05 真实任务 | ✅ 已完成 | 任务跑完后已整理为本节日志;后续若有新增回归再追加 |
| 提交 / push / CI | ✅ 已完成 | xworkmate-app 已提交并推送 main -> 66fd0e4;相关支撑仓也已推送完成 |
8.2 08:47 CST 基线快照
- Bridge:
127.0.0.1:8787正在监听,launchdplus.svc.xworkspace.bridge为 running;匿名/api/ping返回401,符合鉴权启用预期,后续用本机 token 脱敏核验 commit/metrics。 - Gateway:
127.0.0.1:18789正在监听,launchdai.openclaw.gateway为 running。 - 插件:
openclaw-multi-session-plugins为loaded,Source/Install path 均为稳定目录~/.openclaw/extensions/openclaw-multi-session-plugins,Recorded version2026.6.1;S0 的临时目录问题当前未复发。 - 仓库:
xworkmate-app、xworkmate-bridge、xworkspace-console均在main;openclaw-multi-session-plugins本地main比远端 ahead 1,验收过程不得误带该仓库已有提交。 - 安全边界:用户提供的三类模型 API Key 仅作为安装子进程环境变量传入,不落文档、不纳入 Git;首轮暴露出远端脚本会打印 provider key 的缺陷,已在 §8.3 记录并修复本地源码。
8.3 08:54 CST 首轮发现与修复
- T5 测试缺口已补:旧测试仍断言 OpenClaw
tasks.get第一次ACP_HTTP_CONNECTION_CLOSED就立即失败,与「有界续轮询」新契约冲突。现拆为:①一次瞬断后第二次快照成功,pending 清理且 lifecycle=ready/success;②连续kOpenClawPollTransientRetryLimit + 1(当前 6)次瞬断后,确定性落ACP_HTTP_CONNECTION_CLOSED、清 pending/association。两条定向测试均All tests passed!。 - 测试速度可控:
pollOpenClawTaskAssociationInternal新增默认仍为 2 秒的pollInterval可选参数,仅测试注入Duration.zero,生产重试节奏不变。 - 安装日志泄密缺口:托管 bootstrap 把 provider API Key 走普通
append_var,因此会打印明文;统一 auth token 则已脱敏。这不是模型调用失败原因,但违反安装安全边界。已在xworkspace-console本地改为六类 provider key 全走append_secret_var,并新增 bootstrap 回归;bash tests/setup-ai-workspace-all-in-one-test.sh全部通过。当前正在运行的脚本来自修复前远端,最终文档不记录任何 key 值。
8.4 08:59 CST 部署幂等修复与 App 完整定向回归
- 首轮 all-in-one 在
Link openclaw-multi-session-plugins to extensions (macOS)失败:S0 已把目标改成稳定真实目录,而旧 patch 仍强制state: link指向/tmp/源码目录;Ansible 正确拒绝 directory→symlink。自动重跑无法修复结构性错误,因此中止第二轮。 xworkspace-console修复:macOS patch 现在会识别并移除旧临时 symlink、确保~/.openclaw/extensions/openclaw-multi-session-plugins为真实目录、只复制构建产物/manifest,并执行openclaw plugins install <stable-path> --force记录 provenance;不再把 S0 修复倒退成临时链接。- bootstrap 本地执行优先采用同 checkout 的
patch-macos-playbooks.py,远端 fallback 增加 cache-busting,避免 main 刚提交后又下载到 5 分钟 CDN 旧版本。 - 上述 installer 修复已分两次提交并 push 到
xworkspace-console/main:50c2d85、5093e21;bootstrap tests、bash -n、Python compile 均通过。 - App 完整定向回归:
flutter test test/runtime/assistant_execution_target_test.dart→ 74 tests / All tests passed。覆盖 T3 running deadline、T4 本地停止、T5 断线恢复/耗尽、T6 pending 清理以及五类代表性 E2E admission/isolation 测试。
8.5 09:xx CST 收尾结果
xworkmate-app已完成最终提交并推送:66fd0e4 fix(gateway): harden OpenClaw polling and acceptance notes。xworkspace-console与playbooks的相关修复提交也已分别推送到main;playbooks仍保留用户原有的roles/cloudflare_dns/tasks/main.yml本地改动,未触碰。.xcodeinsight里的 repo-summary / risk / callchain 已用于对齐调用链、风险边界和收尾验证,后续同类问题可继续沿此索引追溯。
9. 2026-06-27 全链路状态总览与剩余稳定性 backlog(跨 4 仓 main 已合并)
本节是「重新梳理」的收口:把 §3 失效点 → §5 编码 TODO → §7 live 改进三条线索,对齐到当前各仓 main 的真实代码状态,并给出剩余项的精确验收口径。前文 §5 个别 checkbox 曾滞后于 §2/§6,本节为权威状态。
9.1 四仓 main 基线(全部已提交并推送)
| 仓库 | main HEAD | 本轮关键内容 |
|---|---|---|
xworkmate-app |
82d25de docs(case06): close out acceptance log(代码基线 66fd0e4) |
T3/T4/T5/T6 客户端止血;本案文档 |
xworkmate-bridge |
0a50621 fix(acp): remove orphaned S1 test — keep main compiling |
T7/T8/T9 持久 run 仓、T10 错误语义、T11/T12 可观测;S1 已回退 |
openclaw-multi-session-plugins |
1fe544c ci(runtime-release): publish stable runtime-latest tag |
插件 xworkmate.* 方法源;CI 改 npm 构建 + runtime-latest 稳定发布 URL |
ai-workspace-infra/playbooks |
5c74feb fix(cloudflare_dns): prefer CLOUDFLARE_API_TOKEN |
T1/T2 入口超时+flush 全 route 对齐(70m,validate 断言);cloudflare token 优先级 |
旁路修复另在
xworkspace-console/main(50c2d85+5093e21):macOS 稳定插件目录幂等迁移、provider key 走append_secret_var。
9.2 已完成(一次任务能否「有界、可恢复、能产出」的全部止血+加固)
四层链路(App → bridge → 插件 → gateway)的稳定性闭环已落地:
- 入口层(T1/T2,playbooks):所有 route
flush_interval -1+ 单源70m超时 ≥ bridge 60min,长任务不再 30min 必断。 - 客户端层(T3/T4/T5/T6,app):running 轮询有 deadline 封顶、
停止本地权威、传输瞬断有界续轮询(≤5)、三终态确定清 pending。「无限运行 / 停不掉 / 瞬断丢结果」根除。 - bridge 层(T7/T8/T9/T10,bridge):per-session 持久 run 仓与 WS 解耦,网关瞬断/抖动收敛为「有界续轮询 → deadline 终态」,终态结果回放不丢;socket-closed 带
retryable/poll语义。 - 可观测(T11/T12/T13):
runId贯穿四层日志;/api/ping.metrics暴露三类不稳定计数;运行态双侧校验(bridgecommit+ 网关N plugins)。 - 主根因(S0):插件稳定安装到
~/.openclaw/extensions/真实目录(非 /tmp 软链)+plugins install --forceprovenance,重启不再丢插件、xworkmate.*不再unknown method。
9.3 剩余 backlog(按「直接决定产出」排序)
| 项 | 仓库 | 状态 | 问题 | 正确做法 / 验收 |
|---|---|---|---|---|
S1 缺省 expectedArtifactDirs 兜底扫描 |
bridge(+插件) | ⚠️ 已回退(0280893→81f65e3),待重做 |
旧实现把「有 expectedArtifactDirs」等同「必须阻塞导出」,无产物的成功 run 卡在「等待导出」,E2E 转红 | 解耦「扫描提示」与「阻塞式导出」:缺省目录只驱动插件兜底扫描、不触发 bridge 等待导出;或仅当客户端显式声明 requiredArtifactExtensions 才启用阻塞。验收:全 E2E 套件绿 + 「agent 写产物到 workspace 根」能被收集 |
S2 no_native_task_record 状态歧义 |
插件 + bridge | 🔬 待改进 | chat.send 成功但 gateway 无 detached task 且无产物时,无法区分「还在跑」与「跑完没产物」,只能 degraded 续轮询到 deadline | ①核对 chat.send 是否应注册 detached task(agent/tasks.* 配置);②超过最小执行时长后下发明确 running(no-record) vs completed(no-artifact),配合 T9 deadline 收口。验收:正常执行 tasks.get 能 running→completed;异常给确定终态 |
| T8b 跨进程重启持久化 | bridge | ⏸️ 已知边际项 | 现 per-session run 仓为内存,覆盖「WS 抖动/网关瞬断」,但 bridge 进程重启仍丢终态 | 把 run 仓落磁盘 / 接 xworkmate.jobs.*(序列化 + 启动加载 + 过期清理 + 并发)。独立一轮带测试。验收:bridge 重启后 tasks.get 仍能回放终态 |
9.4 稳定性改进建议(防回归 / 运维侧)
- 运行态≠源码 的常态化校验:把 §5 T13 的双侧检查做成健康探针/CI gate——bridge
/api/ping.commit等于发布 commit、网关启动日志N plugins含openclaw-multi-session-plugins、openclaw plugins inspect无 provenance 警告。三者任一不满足即告警,避免再次「修了但跑的不是它」。 - 插件安装纳入部署:把
openclaw-multi-session-plugins的稳定目录安装收敛进deploy_gateway_openclaw(从仓库构建产物 install,非软链 /tmp),让 S0 修复不依赖手工。 - 超时同源不可漂移:入口
xworkmate_bridge_acp_stream_timeout与 bridgeopenClawAgentWaitMaxTimeout已分两侧定义,建议在 validate/CI 加一条「入口 ≥ bridge + 余量」的交叉断言,防未来单侧改值再漂移。 - S1 重做前先补测:先写「有 expectedArtifactDirs 但 run 无产物」与「agent 写产物到 workspace 根」两类对照 E2E,再改实现,避免重蹈
0280893回退。 /api/ping.metrics接告警:gatewaySocketClosed/taskGetUnconfirmedFallback/runDeadlineInterrupt三计数接监控,使「不稳定」可被观测而非靠用户截图。
9.5 本次验收摘要
这次 case 的结论可以压缩成三句话:
- 不是
LiteLLM余量问题,而是 gateway-turn 的契约链路里,插件加载、运行态快照和结果回传先后顺序出了偏差。 openclaw-multi-session-plugins稳定加载后,xworkmate.tasks.get能回到可持续轮询的终态语义,GoTaskService 没有返回可显示的输出。也随之恢复为可显示结果。- 当前验收标准是:任务能完成、能产出
.md、tasks.get能返回completed + durable output + artifacts,并且 App 不再把 undecoratedrunning快照误判成空终态。