From 49f619dbccfde693817b733249e075fd2fa56927 Mon Sep 17 00:00:00 2001 From: Haitao Pan Date: Sat, 27 Jun 2026 11:27:14 +0800 Subject: [PATCH] fix(gateway): harden OpenClaw polling and acceptance notes --- ...6-gateway-turn-stability-and-robustness.md | 77 +++++++-- ...-27-cases-00-05-gateway-turn-acceptance.md | 73 +++++++++ ...pp_controller_desktop_runtime_helpers.dart | 5 + ...app_controller_desktop_thread_actions.dart | 20 ++- .../assistant_execution_target_test.dart | 155 +++++++++++++----- 5 files changed, 271 insertions(+), 59 deletions(-) create mode 100644 docs/plans/2026-06-27-cases-00-05-gateway-turn-acceptance.md diff --git a/docs/cases/06-gateway-turn-stability-and-robustness.md b/docs/cases/06-gateway-turn-stability-and-robustness.md index fc29c01a..986c3d4b 100644 --- a/docs/cases/06-gateway-turn-stability-and-robustness.md +++ b/docs/cases/06-gateway-turn-stability-and-robustness.md @@ -231,9 +231,10 @@ curl -sS -X POST http://127.0.0.1:8787/acp/rpc \ `abortRun` / `cancelAssistantTaskForSessionInternal`(`:1793`)改为:**先**乐观清 `aiGatewayPendingSessionKeysInternal`、置 lifecycle=`aborted`、退出 loop,**再** best-effort 发 `tasks.cancel`。UI 终止不得依赖 gateway 往返。 验收:gateway 不可达时点 `停止` 仍能立刻停下。 -- [ ] **T5 传输中断降级为"后台续跑·重连中"** - 收到实时 `ACP_HTTP_CONNECTION_CLOSED` 时,不直接当硬失败把任务留在 running,而是降级为"已转后台 / 重连中",触发 `resumeOpenClawTaskAssociationsInternal`(`:906`,目前仅在 thread 加载时触发)续轮询。 - 验收:SSE 瞬断后,任务能自动恢复轮询并最终拿到终态或明确终止。 +- [x] **T5 传输中断降级为"后台续跑·重连中"** — `pollOpenClawTaskAssociationInternal` catch(`thread_actions.dart`) + 轮询期间 App↔bridge 传输瞬断(`ACP_HTTP_CONNECTION_CLOSED`)时,不硬失败丢结果,而是**有界重试续轮询**:连续瞬断 `< kOpenClawPollTransientRetryLimit(=5)` 则保持 running、2s 后重试下一次 `getTask`;每次成功重置计数;超限才落终态。bridge 侧 T7/T9 负责网关侧抖动,这里只兜 App↔bridge 这一跳。 + **取舍**:未引入新的「重连中」UI 相位(避免改进度条布局);任务保持「运行中」即降级态。未走 `resumeOpenClawTaskAssociationsInternal` 全量恢复(那是 thread 重载路径),而是就地有界重试,风险更低、不会无限重连。 + 验收:轮询瞬断 ≤5 次能自动续轮询;持续不可达则在有界次数后落终态。 - [ ] **T6 失败路径与 pending 清理一致性** 审计 `applyGatewayChatFailureInternal`(`:1613`,置 `ready` 但不清 pending)与调用方 `finally`(`:715` 仅 `!handedOffToBridgeTask` 才清 pending)之间的竞态,确保任一终态路径都能确定性地清 pending,杜绝"错误已渲染但仍 running"。 @@ -260,9 +261,9 @@ curl -sS -X POST http://127.0.0.1:8787/acp/rpc \ **取舍**:仅在 gateway **无法确认**时按 deadline 强制终态;gateway 明确回 `running` 时**不**强杀,避免误伤合法长任务(那一侧由客户端 T3 兜底)。 验收:gateway 失联超过 budget 后,`tasks.get` 返回确定 terminal(见 `TestGatewayUnconfirmedFallbackPastDeadlineInterrupts`)。 -- [ ] **T10 错误语义细化** - `gatewayRPCError`(`orchestrator.go:1678`)区分"连接断但 run 仍在后台可查" vs "run 确实失败",前者携带 `runId` + `retryable/poll` 提示,供客户端走 T5 续轮询而非硬失败。 - 验收:客户端能据错误语义区分"重连续跑"与"真失败"。 +- [x] **T10 错误语义细化** — `gatewayRPCError`(`orchestrator.go`) + 对 `OPENCLAW_GATEWAY_SOCKET_CLOSED` 在 Data 中带 `retryable=true`、`poll=true`,表达「连接断但 run 可能仍在后台、可续轮询」语义,供客户端 T5 据此续轮询而非硬失败。 + 验收:socket-closed 错误带 retryable/poll 标记。 - [x] **T13 运行态同步校验(bridge 二进制 + 网关插件)** 「源码已修但跑的不是它」是反复踩的坑,需双侧确认: @@ -272,13 +273,13 @@ curl -sS -X POST http://127.0.0.1:8787/acp/rpc \ ### L3 可观测性(横切 · infra/service/lab) -- [ ] **T11 端到端贯穿 runId** - App 日志 → Caddy access log → bridge SSE 日志(已有 `component=acp_sse`,`http_handler.go:221`)→ gateway run,全链路带同一 `runId`,便于定位"入口断"还是"WS 断"。 - 验收:任一 `runId` 可在四层日志串联。 +- [x] **T11 端到端贯穿 runId** — `openclaw_run_registry.go` + 在 `tasks_get_unconfirmed_fallback`、`run_deadline_interrupt` 两处加 `runId`/`openclawSessionKey` 标记的 warn 日志,可与 App→bridge→插件→gateway 按 `runId` 串联(既有 `component=acp_sse` 已带 requestId)。 + 验收:socket 抖动 / deadline 终态在 bridge 日志可按 runId 定位。 -- [ ] **T12 关键指标 + 告警** - bridge 暴露:`SOCKET_CLOSED 在途任务数`、gateway WS 重连计数、running 轮询超 deadline 计数。 - 验收:⑥ 类事件发生即在监控可见,无需靠用户截图。 +- [x] **T12 关键指标** — `internal/acp/metrics.go`,经 `/api/ping.metrics` 暴露 + 进程内计数:`gatewaySocketClosed`、`taskGetUnconfirmedFallback`、`runDeadlineInterrupt`。live 验证 `/api/ping` 已返回 `metrics` 字段(commit `0a50621`)。 + 验收:三类不稳定事件可监控,无需靠用户截图。(告警接入留运维侧) --- @@ -287,8 +288,11 @@ curl -sS -X POST http://127.0.0.1:8787/acp/rpc \ 0. ✅ **主根因修复**(live 验证):让 OpenClaw 网关稳定加载 `openclaw-multi-session-plugins`——`openclaw plugins install` 从稳定路径重装 + 重启网关,确认启动日志 `6 plugins … openclaw-multi-session-plugins`、`xworkmate.*` 不再 `unknown method`。这是「采集AI资讯能产出」的前提(详见 §4)。 1. ✅ **当天止血**(已合并 main):T1 + T2(入口配置)+ T3 + T4 + T6(客户端)+ session.prepare 数字 code 降级,消除"30min 必断 / 路由漏配 / 无限 running / 停不掉"。 说明:session.prepare 数字 code 降级仍有价值——当插件**未**加载时,让 bridge 优雅 fallback 而非硬失败;插件加载后走真实 plugin 路径。 -2. ✅ **健壮性加固**(本地验证 commit `2333c3e`):T7 + T8 + T9(bridge 持久 run 仓与 WS 解耦),把网关短暂不可达 / 抖动收敛为「有界续轮询 → deadline 终态」,而非无限运行/丢结果。 -3. **跟进(待办)**:T5 + T10(断连续跑语义)、T11 + T12(可观测性)、T8b(跨进程重启持久化,接 `xworkmate.jobs.*` / 磁盘);运行态校验:每次替换 bridge 二进制 / 网关重启后,核对 `/api/ping.commit` 与网关 `N plugins` 列表。 +2. ✅ **健壮性加固**(commit `2333c3e`):T7 + T8 + T9(bridge 持久 run 仓与 WS 解耦),把网关短暂不可达 / 抖动收敛为「有界续轮询 → deadline 终态」,而非无限运行/丢结果。 +3. ✅ **断连语义 + 可观测**(commit `0a50621`):T10(socket-closed 带 retryable/poll)+ T5(App 轮询瞬断有界续轮询)+ T11(runId 日志)+ T12(`/api/ping.metrics` 计数)。 +4. **剩余**: + - **S1(已回退,待重做)**:缺省 `expectedArtifactDirs` 会让「期望产物但实际无产物」的 run 卡在「等待导出」(破坏 E2E 测试)。根因是 `openClawTaskGetRequiresArtifactExport` 把「有 expectedArtifactDirs」等同「必须导出/阻塞」。**正确做法**:解耦「扫描提示」与「阻塞式导出要求」——让缺省目录只驱动插件的兜底扫描、不触发 bridge 的等待导出。需单独一轮、对全 E2E 套件验证。 + - **T8b(跨进程重启持久化)**:把 per-session run 仓落磁盘 / 接 `xworkmate.jobs.*`,让 bridge **进程重启**后仍能回放终态。当前内存仓已覆盖「WS 抖动 / 网关瞬断」(同进程内),跨重启是较小边际收益、较大复杂度(序列化 / 启动加载 / 过期清理 / 并发),建议作为独立一轮带测试做。 > 回归对照:本目录 `00-review-env-and-matrix.md` 第 2 节"通用验收标准"中"长任务执行期间状态流 / 取消 / 重试稳定""同一任务重复执行 3 次不卡死",即本规划的回归出口。 > 产物交付链(artifact scope / workspace 路径)的独立缺陷与修复,见 `openclaw-gateway-e2e-regression/ROOT_CAUSE_ANALYSIS.md`。 @@ -305,10 +309,10 @@ curl -sS -X POST http://127.0.0.1:8787/acp/rpc \ **验证**:启动日志 `http server listening (6 plugins: … openclaw-multi-session-plugins)`;`inspect` 的 `Source` 变为 `~/.openclaw/extensions/…/dist/index.js`、provenance 警告消失;`xworkmate.session.prepare` 经 bridge 返回**真实插件响应**(`fallback=null`、带 `mapping`、`artifactScope=tasks/draft_s0verify/s0-run`),不再走 bridge 的 `local-session-prepare` 降级。 收尾:`~/.openclaw/extensions/` 现为真实目录(非 /tmp 软链),重启/重启后不再丢插件;建议把它纳入部署(`deploy_gateway_openclaw`)从仓库 `openclaw-multi-session-plugins` 安装,避免再被软链到临时盘。 -- **S1 `expectedArtifactDirs` 为空导致根目录兜底失效 — ✅ 已修复并 live 验证(commit `0280893`)** +- **S1 `expectedArtifactDirs` 为空导致根目录兜底失效 — ⚠️ 一版本已合并后回退(commit `0280893` → 回退于 `81f65e3`)** 根因:live 的 session mapping 为 `expectedArtifactDirs:[]`,而插件对「agent 把产物写到 workspace 根 `reports/`/`artifacts/` 而非 task scope」的兜底扫描**依赖 `expectedArtifactDirs`**;为空 → 兜底形同虚设 → 即便 agent 产出也收不到,表现「暂无文件」。 - 修复:`orchestrator.go openClawArtifactContractForParams` 在「任务期望产物(`requiresExport` 或推断出 `requiredExts`)但未声明目录」时补缺省 `["reports/","artifacts/","exports/"]` 并置 `requiresExport=true`;纯聊天不受影响(`defaultOpenClawExpectedArtifactDirs`,含单测 `orchestrator_s1_artifact_dirs_test.go`)。 - 验证:提交「采集AI资讯保存md」→ `requiresArtifactExport=true`、`expectedArtifactDirs=['reports/','artifacts/','exports/']`(修复前为 `[]`)。 + **回退原因**:当时的实现给所有「推断出 requiredExts」的任务补缺省目录并置 `requiresExport=true`,导致 gateway run 成功但**实际无产物**时卡在「等待 artifact 导出」(`TestHTTPHandlerGatewayOpenClawHandlesFiveConcurrentE2ECases` 等转红)。阻塞来自 `openClawTaskGetRequiresArtifactExport` 把「有 expectedArtifactDirs」等同「必须导出」。 + **正确做法(待重做)**:解耦「扫描提示」与「阻塞式导出」——缺省目录只驱动插件兜底扫描、不触发 bridge 等待导出;或仅在客户端**显式**声明 `requiredArtifactExtensions` 时启用。需单独一轮、对全 E2E 套件验证后再上。 - **S2 `no_native_task_record` 状态歧义** — `xworkmate.tasks.get` 的真值来自「gateway host task registry 有该 run 的 detached task」**或**「artifact 已存在」。live 中 chat.send 成功但 gateway 无 native task record(agent 可能以 inline chat 执行、未注册可查 task),且无产物 → 插件回 `no_native_task_record`,bridge 只能靠 T7 兜底续轮询到 deadline,**无法区分「还在跑」与「跑完没产物」**。 改进:①确认 gateway 侧 chat.send 是否应产出 detached task(agent 配置/ `tasks.*` 注册);②插件/bridge 在 `no_native_task_record` 且超过最小执行时长时,下发更明确的 `running(no-record)` vs `completed(no-artifact)` 语义,配合 §5 T9 deadline 收口。 @@ -317,3 +321,42 @@ curl -sS -X POST http://127.0.0.1:8787/acp/rpc \ - **S3 三元组一致性(已知约束)** — 插件严校 `sessionKey/runId/artifactScope` 三者一致(`exportArtifacts.ts:126`),且 bridge 的 openclawSessionKey 由 `agent:main:` + appThreadKey 组成。**调用方/探针不要预带 `agent:main:` 前缀**(否则双前缀 → `artifactScope does not match`)。bridge `taskGetParamsWithSessionScope` 已负责补齐;保持其为唯一可信来源,App/探针只传 `sessionId=draft:` + `runId`。 - **S4 运行态可观测** — 沿用 §5 T11/T12:bridge `/api/ping.commit`、网关 `N plugins` 列表、`openclaw plugins inspect` 三处纳入健康检查;`runId` 贯穿 App→bridge→插件→gateway 日志,便于定位断点落在四层中的哪一层。 + +--- + +## 8. 2026-06-27 Cases 00–05 全面验收执行日志(进行中) + +> 执行计划:`docs/plans/2026-06-27-cases-00-05-gateway-turn-acceptance.md`。本节只记录脱敏后的运行证据;API Key、Bridge Token、账号密码不写入仓库。 +> 追溯参考:`.xcodeinsight/context/repo-summary.md`、`.xcodeinsight/index/risk-index.md`、`.xcodeinsight/index/callchain-index.md`,用于对齐 `xworkmate-app` / `xworkmate-bridge` / `openclaw-multi-session-plugins` / `playbooks` 的调用链与风险边界。 + +### 8.1 当前目标与状态 + +| 阶段 | 状态 | 当前证据 / 下一步 | +|---|---|---| +| 仓库与运行态基线 | 🟡 进行中 | App `main=ca9cba6`;存在本轮未提交的 T5 + 文档改动,保留并纳入测试 | +| 本地 all-in-one 部署 | 🟡 进行中 | 首轮在稳定插件目录幂等迁移处失败;修复已提交 `xworkspace-console` main(`50c2d85` + `5093e21`),本地修复版正在重跑 | +| Gateway Turn 定向回归 | 🟢 已通过 | T5 两条新增定向测试通过;完整 `assistant_execution_target_test.dart` 74 条通过 | +| Cases 00–05 真实任务 | ⏳ 待执行 | 每项记录 runId、终态、耗时、结构/Artifact、重复与失败收口 | +| 提交 / push / CI | ⏳ 待执行 | 完成全量回归后提交;网络瞬态失败自动有界重试 | + +### 8.2 08:47 CST 基线快照 + +- Bridge:`127.0.0.1:8787` 正在监听,launchd `plus.svc.xworkspace.bridge` 为 running;匿名 `/api/ping` 返回 `401`,符合鉴权启用预期,后续用本机 token 脱敏核验 commit/metrics。 +- Gateway:`127.0.0.1:18789` 正在监听,launchd `ai.openclaw.gateway` 为 running。 +- 插件:`openclaw-multi-session-plugins` 为 `loaded`,Source/Install path 均为稳定目录 `~/.openclaw/extensions/openclaw-multi-session-plugins`,Recorded version `2026.6.1`;S0 的临时目录问题当前未复发。 +- 仓库:`xworkmate-app`、`xworkmate-bridge`、`xworkspace-console` 均在 `main`;`openclaw-multi-session-plugins` 本地 `main` 比远端 ahead 1,验收过程不得误带该仓库已有提交。 +- 安全边界:用户提供的三类模型 API Key 仅作为安装子进程环境变量传入,不落文档、不纳入 Git;首轮暴露出远端脚本会打印 provider key 的缺陷,已在 §8.3 记录并修复本地源码。 + +### 8.3 08:54 CST 首轮发现与修复 + +- **T5 测试缺口已补**:旧测试仍断言 OpenClaw `tasks.get` 第一次 `ACP_HTTP_CONNECTION_CLOSED` 就立即失败,与「有界续轮询」新契约冲突。现拆为:①一次瞬断后第二次快照成功,pending 清理且 lifecycle=`ready/success`;②连续 `kOpenClawPollTransientRetryLimit + 1`(当前 6)次瞬断后,确定性落 `ACP_HTTP_CONNECTION_CLOSED`、清 pending/association。两条定向测试均 `All tests passed!`。 +- **测试速度可控**:`pollOpenClawTaskAssociationInternal` 新增默认仍为 2 秒的 `pollInterval` 可选参数,仅测试注入 `Duration.zero`,生产重试节奏不变。 +- **安装日志泄密缺口**:托管 bootstrap 把 provider API Key 走普通 `append_var`,因此会打印明文;统一 auth token 则已脱敏。这不是模型调用失败原因,但违反安装安全边界。已在 `xworkspace-console` 本地改为六类 provider key 全走 `append_secret_var`,并新增 bootstrap 回归;`bash tests/setup-ai-workspace-all-in-one-test.sh` 全部通过。当前正在运行的脚本来自修复前远端,最终文档不记录任何 key 值。 + +### 8.4 08:59 CST 部署幂等修复与 App 完整定向回归 + +- 首轮 all-in-one 在 `Link openclaw-multi-session-plugins to extensions (macOS)` 失败:S0 已把目标改成稳定真实目录,而旧 patch 仍强制 `state: link` 指向 `/tmp`/源码目录;Ansible 正确拒绝 directory→symlink。自动重跑无法修复结构性错误,因此中止第二轮。 +- `xworkspace-console` 修复:macOS patch 现在会识别并移除旧临时 symlink、确保 `~/.openclaw/extensions/openclaw-multi-session-plugins` 为真实目录、只复制构建产物/manifest,并执行 `openclaw plugins install --force` 记录 provenance;不再把 S0 修复倒退成临时链接。 +- bootstrap 本地执行优先采用同 checkout 的 `patch-macos-playbooks.py`,远端 fallback 增加 cache-busting,避免 main 刚提交后又下载到 5 分钟 CDN 旧版本。 +- 上述 installer 修复已分两次提交并 push 到 `xworkspace-console/main`:`50c2d85`、`5093e21`;bootstrap tests、`bash -n`、Python compile 均通过。 +- App 完整定向回归:`flutter test test/runtime/assistant_execution_target_test.dart` → **74 tests / All tests passed**。覆盖 T3 running deadline、T4 本地停止、T5 断线恢复/耗尽、T6 pending 清理以及五类代表性 E2E admission/isolation 测试。 diff --git a/docs/plans/2026-06-27-cases-00-05-gateway-turn-acceptance.md b/docs/plans/2026-06-27-cases-00-05-gateway-turn-acceptance.md new file mode 100644 index 00000000..0d86b098 --- /dev/null +++ b/docs/plans/2026-06-27-cases-00-05-gateway-turn-acceptance.md @@ -0,0 +1,73 @@ +# Cases 00–05 Gateway Turn Stability Acceptance Implementation Plan + +> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. + +**Goal:** 在本机 all-in-one 运行态中完成 `docs/cases/00–05` 的真实任务验收,修复发现的 Gateway Turn 稳定性缺口,并把全过程证据回写到 Case 06。 + +**Architecture:** 以 Case 06 的 App → Bridge → multi-session plugin → OpenClaw gateway 四层链路为主线。先验证安装与运行态,再以定向测试锁住取消、超时、断线与 pending 清理语义,最后通过真实 Bridge 请求执行五类任务并核对状态、Artifact、重复执行与失败收口。 + +**Tech Stack:** Flutter/Dart、Go、OpenClaw gateway、JSON-RPC/SSE、macOS launchd、GitHub Actions。 + +--- + +### Task 1: Baseline and progress ledger + +**Files:** +- Modify: `docs/cases/06-gateway-turn-stability-and-robustness.md` +- Create: `docs/plans/2026-06-27-cases-00-05-gateway-turn-acceptance.md` + +**Steps:** +1. Capture branch, commit, dirty files, installed services, `/api/ping`, gateway plugin provenance, and Case 00 prerequisites. +2. Add a timestamped execution ledger to Case 06 without storing API keys or auth tokens. +3. Record each later command by outcome and evidence, not by secrets-bearing command line. + +### Task 2: Local all-in-one deployment + +**Files:** +- Modify only when a reproducible installer/runtime defect is found; keep each fix in its owning repository. +- Modify: `docs/cases/06-gateway-turn-stability-and-robustness.md` + +**Steps:** +1. Inspect the hosted bootstrap before execution and verify its final origin/redirect. +2. Run the installer with `DEEPSEEK_API_KEY`, `NVIDIA_API_KEY`, and `OLLAMA_API_KEY` supplied only through the child-process environment. +3. Retry transient download/network failures with bounded backoff. +4. Verify Bridge `/api/ping.commit`, gateway port `18789`, plugin stable path/provenance, and `xworkmate.session.prepare` behavior. + +### Task 3: Gateway Turn regression implementation + +**Files:** +- Modify: `lib/app/app_controller_desktop_runtime_helpers.dart` +- Modify: `lib/app/app_controller_desktop_thread_actions.dart` +- Test: `test/runtime/assistant_execution_target_test.dart` +- Test other runtime files only when the failure belongs there. + +**Steps:** +1. Add failing tests for bounded `ACP_HTTP_CONNECTION_CLOSED` polling recovery and retry exhaustion. +2. Run the focused test and confirm the new assertion fails before implementation where practicable. +3. Implement the smallest change that preserves pending during bounded retry and deterministically reaches terminal state after exhaustion. +4. Run `flutter test test/runtime/assistant_execution_target_test.dart` and related gateway runtime tests. +5. Run `scripts/ci/run_layered_tests.sh` to match the repository CI path. + +### Task 4: Cases 00–05 live acceptance + +**Files:** +- Modify: `docs/cases/06-gateway-turn-stability-and-robustness.md` + +**Steps:** +1. Verify Case 00 connectivity and error behavior against the local Bridge/Gateway runtime. +2. Execute Case 01 three times; validate Markdown structure, terminal status, non-empty Artifact, and isolation. +3. Execute Cases 02–05 with the documented prompts and a deterministic local test image for Case 02. +4. Exercise cancellation, invalid auth/unreachable endpoint, repeated runs, exact item/page counts, and Artifact retrieval where supported. +5. Record run IDs, durations, terminal status, artifact paths/counts, and any scoped deviations. + +### Task 5: Full verification and delivery + +**Files:** +- Modify: `docs/cases/06-gateway-turn-stability-and-robustness.md` + +**Steps:** +1. Run formatting/analyzer and the full relevant Flutter suite; rerun failed tests individually to distinguish deterministic regressions from flakes. +2. Review the complete diff and confirm no credential values or generated secrets are tracked. +3. Commit cohesive changes on `main` with explicit messages. +4. Push to `origin/main`; retry transient push failures with bounded exponential backoff. +5. Verify the newest GitHub Actions run(s) and append final acceptance status/blockers to Case 06. diff --git a/lib/app/app_controller_desktop_runtime_helpers.dart b/lib/app/app_controller_desktop_runtime_helpers.dart index 481e448b..fb66cb11 100644 --- a/lib/app/app_controller_desktop_runtime_helpers.dart +++ b/lib/app/app_controller_desktop_runtime_helpers.dart @@ -65,6 +65,11 @@ const Map kOpenClawRunningPollBudgets = { }; const String kOpenClawRunningPollTimeoutCode = 'OPENCLAW_RUN_POLL_TIMEOUT'; +// T5(docs/cases/06 §5):轮询期间 App↔bridge 传输瞬断(ACP_HTTP_CONNECTION_CLOSED)时, +// 不直接硬失败,而是有界重试续轮询(降级为「后台续跑·重连中」)。连续瞬断超过该上限才落终态, +// 避免桥/网关真正不可达时无限重连。每次成功 getTask 会重置计数,仅累计「连续」瞬断。 +const int kOpenClawPollTransientRetryLimit = 5; + bool openClawArtifactPathHasRequiredExtension(String path, String extension) { final normalizedPath = path.trim().toLowerCase(); final normalizedExtension = extension.trim().toLowerCase().replaceFirst( diff --git a/lib/app/app_controller_desktop_thread_actions.dart b/lib/app/app_controller_desktop_thread_actions.dart index ff817386..82642279 100644 --- a/lib/app/app_controller_desktop_thread_actions.dart +++ b/lib/app/app_controller_desktop_thread_actions.dart @@ -748,6 +748,7 @@ extension AppControllerDesktopThreadActions on AppController { required String sessionKey, required AssistantExecutionTarget target, required OpenClawTaskAssociation association, + Duration pollInterval = const Duration(seconds: 2), }) async { var current = association; var firstAttempt = true; @@ -755,6 +756,8 @@ extension AppControllerDesktopThreadActions on AppController { double? artifactSyncStartedAtMs; // T3: running 轮询兜底截止的起算锚点(首次进入 running 轮询的时间)。 double? runningPollFirstAtMs; + // T5: 连续传输瞬断计数(每次成功 getTask 重置)。 + var transientRetries = 0; final existingThread = taskThreadForSessionInternal(sessionKey); if (association.status.trim().toLowerCase() == 'syncing-artifacts') { artifactSyncStartedAtMs = existingThread?.lastArtifactSyncAtMs; @@ -767,7 +770,7 @@ extension AppControllerDesktopThreadActions on AppController { return; } if (!firstAttempt) { - await Future.delayed(const Duration(seconds: 2)); + await Future.delayed(pollInterval); } firstAttempt = false; try { @@ -779,6 +782,8 @@ extension AppControllerDesktopThreadActions on AppController { if (disposedInternal) { return; } + // T5: 成功一次即重置连续瞬断计数。 + transientRetries = 0; final nextAssociation = result.openClawTaskAssociation ?? current.copyWith( @@ -904,6 +909,19 @@ extension AppControllerDesktopThreadActions on AppController { if (disposedInternal) { return; } + // T5: 轮询期间 App↔bridge 传输瞬断(ACP_HTTP_CONNECTION_CLOSED)时,有界重试续轮询, + // 降级为「后台续跑·重连中」而非硬失败丢结果。bridge 侧 T7/T9 会在网关侧抖动时保持 run + // 可查/到点收口;这里只兜 App↔bridge 这一跳的瞬断。连续超过上限才落终态。 + if (aiGatewayPendingSessionKeysInternal.contains(sessionKey) && + interruptedAcpHttpTransportCodeInternal(error) == + 'ACP_HTTP_CONNECTION_CLOSED' && + transientRetries < kOpenClawPollTransientRetryLimit) { + transientRetries += 1; + // 不清 pending、不落终态:任务保持「运行中」,循环顶部 2s 延迟后重试下一次 getTask。 + recomputeTasksInternal(); + notifyIfActiveInternal(); + continue; + } if (aiGatewayPendingSessionKeysInternal.contains(sessionKey)) { await applyGatewayChatFailureInternal( sessionKey: sessionKey, diff --git a/test/runtime/assistant_execution_target_test.dart b/test/runtime/assistant_execution_target_test.dart index c96dc6e4..bf4be715 100644 --- a/test/runtime/assistant_execution_target_test.dart +++ b/test/runtime/assistant_execution_target_test.dart @@ -4087,56 +4087,131 @@ void main() { }, ); - test('OpenClaw task snapshot failure records a terminal result', () async { + test('OpenClaw task snapshot transient failure retries and recovers', () async { final fakeGoTaskService = _RecordingGoTaskServiceClient() - ..outcomes.add( - const GoTaskServiceResult( - success: true, - message: '', - turnId: 'turn-openclaw-poll-failed', - raw: { - 'success': true, - 'status': 'running', - 'sessionId': 'openclaw-poll-failed-task', - 'threadId': 'openclaw-poll-failed-task', - 'appThreadKey': 'openclaw-poll-failed-task', - 'openclawSessionKey': 'agent:main:openclaw-poll-failed-task', - 'turnId': 'turn-openclaw-poll-failed', - 'runId': 'run-openclaw-poll-failed', - 'artifactScope': - 'tasks/agent:main:openclaw-poll-failed-task/run-openclaw-poll-failed', - 'artifactDirectory': - '/tmp/tasks/agent:main:openclaw-poll-failed-task/run-openclaw-poll-failed', - 'gatewayProviderId': 'openclaw', - 'runtimeBudgetMinutes': 1, - }, - errorMessage: '', - resolvedModel: '', - route: GoTaskServiceRoute.externalAcpSingle, - ), - ) ..taskOutcomes.add( const GatewayAcpException( 'ACP HTTP connection closed before the OpenClaw task snapshot returned', code: 'ACP_HTTP_CONNECTION_CLOSED', ), + ) + ..taskOutcomes.add( + const GoTaskServiceResult( + success: true, + message: 'recovered task result', + turnId: 'turn-openclaw-poll-recovered', + raw: { + 'success': true, + 'status': 'completed', + 'turnId': 'turn-openclaw-poll-recovered', + 'runId': 'run-openclaw-poll-recovered', + 'output': 'recovered task result', + }, + errorMessage: '', + resolvedModel: '', + route: GoTaskServiceRoute.externalAcpSingle, + ), ); final controller = _connectedGatewayController(fakeGoTaskService); addTearDown(controller.dispose); + const association = OpenClawTaskAssociation( + sessionId: 'openclaw-poll-recovered', + threadId: 'openclaw-poll-recovered', + turnId: 'turn-openclaw-poll-recovered', + runId: 'run-openclaw-poll-recovered', + artifactScope: + 'tasks/agent:main:openclaw-poll-recovered/run-openclaw-poll-recovered', + artifactDirectory: + '/tmp/tasks/agent:main:openclaw-poll-recovered/run-openclaw-poll-recovered', + gatewayProviderId: 'openclaw', + startedAtMs: 0, + status: 'running', + appThreadKey: 'openclaw-poll-recovered', + openclawSessionKey: 'agent:main:openclaw-poll-recovered', + ); + controller.upsertTaskThreadInternal( + association.sessionId, + executionTarget: AssistantExecutionTarget.gateway, + selectedProvider: SingleAgentProvider.openclaw, + lifecycleStatus: 'running', + lastResultCode: 'running', + openClawTaskAssociation: association, + ); + controller.aiGatewayPendingSessionKeysInternal.add(association.sessionId); - await _selectGatewaySession(controller, 'openclaw-poll-failed-task'); - - await expectLater( - controller - .sendChatMessage('输出 PDF') - .timeout(const Duration(seconds: 2)), - completes, + await controller.pollOpenClawTaskAssociationInternal( + sessionKey: association.sessionId, + target: AssistantExecutionTarget.gateway, + association: association, + pollInterval: Duration.zero, ); - await Future.delayed(const Duration(milliseconds: 100)); + final recoveredThread = controller.requireTaskThreadForSessionInternal( + association.sessionId, + ); + expect(fakeGoTaskService.getTaskCount, 2); + expect(recoveredThread.lifecycleState.status, 'ready'); + expect(recoveredThread.lifecycleState.lastResultCode, 'success'); + expect( + controller.assistantSessionHasPendingRun(association.sessionId), + isFalse, + ); + }); + + test('OpenClaw task snapshot retry exhaustion is terminal', () async { + final fakeGoTaskService = _RecordingGoTaskServiceClient(); + for ( + var attempt = 0; + attempt <= kOpenClawPollTransientRetryLimit; + attempt += 1 + ) { + fakeGoTaskService.taskOutcomes.add( + const GatewayAcpException( + 'ACP HTTP connection closed before the OpenClaw task snapshot returned', + code: 'ACP_HTTP_CONNECTION_CLOSED', + ), + ); + } + final controller = _connectedGatewayController(fakeGoTaskService); + addTearDown(controller.dispose); + const association = OpenClawTaskAssociation( + sessionId: 'openclaw-poll-exhausted', + threadId: 'openclaw-poll-exhausted', + turnId: 'turn-openclaw-poll-exhausted', + runId: 'run-openclaw-poll-exhausted', + artifactScope: + 'tasks/agent:main:openclaw-poll-exhausted/run-openclaw-poll-exhausted', + artifactDirectory: + '/tmp/tasks/agent:main:openclaw-poll-exhausted/run-openclaw-poll-exhausted', + gatewayProviderId: 'openclaw', + startedAtMs: 0, + status: 'running', + appThreadKey: 'openclaw-poll-exhausted', + openclawSessionKey: 'agent:main:openclaw-poll-exhausted', + ); + controller.upsertTaskThreadInternal( + association.sessionId, + executionTarget: AssistantExecutionTarget.gateway, + selectedProvider: SingleAgentProvider.openclaw, + lifecycleStatus: 'running', + lastResultCode: 'running', + openClawTaskAssociation: association, + ); + controller.aiGatewayPendingSessionKeysInternal.add(association.sessionId); + + await controller.pollOpenClawTaskAssociationInternal( + sessionKey: association.sessionId, + target: AssistantExecutionTarget.gateway, + association: association, + pollInterval: Duration.zero, + ); final failedThread = controller.requireTaskThreadForSessionInternal( - 'openclaw-poll-failed-task', + association.sessionId, + ); + expect( + fakeGoTaskService.getTaskCount, + kOpenClawPollTransientRetryLimit + 1, ); expect(failedThread.lifecycleState.status, 'ready'); expect( @@ -4146,13 +4221,9 @@ void main() { expect(failedThread.lastArtifactSyncStatus, 'failed'); expect(failedThread.openClawTaskAssociation, isNull); expect( - controller.assistantSessionHasPendingRun('openclaw-poll-failed-task'), + controller.assistantSessionHasPendingRun(association.sessionId), isFalse, ); - expect( - controller.chatMessages.map((message) => message.text).join('\n'), - contains('ACP_HTTP_CONNECTION_CLOSED'), - ); }); test('OpenClaw running poll times out and clears pending state', () async { @@ -4999,6 +5070,7 @@ Future _waitForThreadLastResultCode( class _RecordingGoTaskServiceClient implements GoTaskServiceClient { int executeCount = 0; + int getTaskCount = 0; final List requests = []; final List updatesBeforeNextOutcome = []; @@ -5054,6 +5126,7 @@ class _RecordingGoTaskServiceClient implements GoTaskServiceClient { required OpenClawTaskAssociation association, required GoTaskServiceRoute route, }) async { + getTaskCount += 1; if (taskOutcomes.isNotEmpty) { final outcome = taskOutcomes.removeAt(0); if (outcome is GoTaskServiceResult) {