fix(gateway): harden OpenClaw polling and acceptance notes

This commit is contained in:
Haitao Pan 2026-06-27 11:27:14 +08:00
parent 3422aae9bf
commit 49f619dbcc
5 changed files with 271 additions and 59 deletions

View File

@ -231,9 +231,10 @@ curl -sS -X POST http://127.0.0.1:8787/acp/rpc \
`abortRun` / `cancelAssistantTaskForSessionInternal``:1793`)改为:**先**乐观清 `aiGatewayPendingSessionKeysInternal`、置 lifecycle=`aborted`、退出 loop**再** best-effort 发 `tasks.cancel`。UI 终止不得依赖 gateway 往返。 `abortRun` / `cancelAssistantTaskForSessionInternal``:1793`)改为:**先**乐观清 `aiGatewayPendingSessionKeysInternal`、置 lifecycle=`aborted`、退出 loop**再** best-effort 发 `tasks.cancel`。UI 终止不得依赖 gateway 往返。
验收gateway 不可达时点 `停止` 仍能立刻停下。 验收gateway 不可达时点 `停止` 仍能立刻停下。
- [ ] **T5 传输中断降级为"后台续跑·重连中"** - [x] **T5 传输中断降级为"后台续跑·重连中"**`pollOpenClawTaskAssociationInternal` catch`thread_actions.dart`
收到实时 `ACP_HTTP_CONNECTION_CLOSED` 时,不直接当硬失败把任务留在 running而是降级为"已转后台 / 重连中",触发 `resumeOpenClawTaskAssociationsInternal``:906`,目前仅在 thread 加载时触发)续轮询。 轮询期间 App↔bridge 传输瞬断(`ACP_HTTP_CONNECTION_CLOSED`)时,不硬失败丢结果,而是**有界重试续轮询**:连续瞬断 `< kOpenClawPollTransientRetryLimit(=5)` 则保持 running、2s 后重试下一次 `getTask`每次成功重置计数超限才落终态。bridge 侧 T7/T9 负责网关侧抖动,这里只兜 App↔bridge 这一跳。
验收SSE 瞬断后,任务能自动恢复轮询并最终拿到终态或明确终止。 **取舍**未引入新的「重连中」UI 相位(避免改进度条布局);任务保持「运行中」即降级态。未走 `resumeOpenClawTaskAssociationsInternal` 全量恢复(那是 thread 重载路径),而是就地有界重试,风险更低、不会无限重连。
验收:轮询瞬断 ≤5 次能自动续轮询;持续不可达则在有界次数后落终态。
- [ ] **T6 失败路径与 pending 清理一致性** - [ ] **T6 失败路径与 pending 清理一致性**
审计 `applyGatewayChatFailureInternal``:1613`,置 `ready` 但不清 pending与调用方 `finally``:715` 仅 `!handedOffToBridgeTask` 才清 pending之间的竞态确保任一终态路径都能确定性地清 pending杜绝"错误已渲染但仍 running"。 审计 `applyGatewayChatFailureInternal``:1613`,置 `ready` 但不清 pending与调用方 `finally``:715` 仅 `!handedOffToBridgeTask` 才清 pending之间的竞态确保任一终态路径都能确定性地清 pending杜绝"错误已渲染但仍 running"。
@ -260,9 +261,9 @@ curl -sS -X POST http://127.0.0.1:8787/acp/rpc \
**取舍**:仅在 gateway **无法确认**时按 deadline 强制终态gateway 明确回 `running` 时**不**强杀,避免误伤合法长任务(那一侧由客户端 T3 兜底)。 **取舍**:仅在 gateway **无法确认**时按 deadline 强制终态gateway 明确回 `running` 时**不**强杀,避免误伤合法长任务(那一侧由客户端 T3 兜底)。
验收gateway 失联超过 budget 后,`tasks.get` 返回确定 terminal`TestGatewayUnconfirmedFallbackPastDeadlineInterrupts`)。 验收gateway 失联超过 budget 后,`tasks.get` 返回确定 terminal`TestGatewayUnconfirmedFallbackPastDeadlineInterrupts`)。
- [ ] **T10 错误语义细化** - [x] **T10 错误语义细化**`gatewayRPCError``orchestrator.go`
`gatewayRPCError``orchestrator.go:1678`)区分"连接断但 run 仍在后台可查" vs "run 确实失败",前者携带 `runId` + `retryable/poll` 提示,供客户端走 T5 续轮询而非硬失败。 `OPENCLAW_GATEWAY_SOCKET_CLOSED` 在 Data 中带 `retryable=true`、`poll=true`,表达「连接断但 run 可能仍在后台、可续轮询」语义,供客户端 T5 据此续轮询而非硬失败。
验收:客户端能据错误语义区分"重连续跑"与"真失败" 验收:socket-closed 错误带 retryable/poll 标记
- [x] **T13 运行态同步校验bridge 二进制 + 网关插件)** - [x] **T13 运行态同步校验bridge 二进制 + 网关插件)**
「源码已修但跑的不是它」是反复踩的坑,需双侧确认: 「源码已修但跑的不是它」是反复踩的坑,需双侧确认:
@ -272,13 +273,13 @@ curl -sS -X POST http://127.0.0.1:8787/acp/rpc \
### L3 可观测性(横切 · infra/service/lab ### L3 可观测性(横切 · infra/service/lab
- [ ] **T11 端到端贯穿 runId** - [x] **T11 端到端贯穿 runId**`openclaw_run_registry.go`
App 日志 → Caddy access log → bridge SSE 日志(已有 `component=acp_sse``http_handler.go:221`)→ gateway run全链路带同一 `runId`,便于定位"入口断"还是"WS 断" `tasks_get_unconfirmed_fallback`、`run_deadline_interrupt` 两处加 `runId`/`openclawSessionKey` 标记的 warn 日志,可与 App→bridge→插件→gateway 按 `runId` 串联(既有 `component=acp_sse` 已带 requestId
验收:任一 `runId` 可在四层日志串联 验收:socket 抖动 / deadline 终态在 bridge 日志可按 runId 定位
- [ ] **T12 关键指标 + 告警** - [x] **T12 关键指标**`internal/acp/metrics.go`,经 `/api/ping.metrics` 暴露
bridge 暴露:`SOCKET_CLOSED 在途任务数`、gateway WS 重连计数、running 轮询超 deadline 计数 进程内计数:`gatewaySocketClosed`、`taskGetUnconfirmedFallback`、`runDeadlineInterrupt`。live 验证 `/api/ping` 已返回 `metrics` 字段commit `0a50621`
验收:⑥ 类事件发生即在监控可见,无需靠用户截图。 验收:三类不稳定事件可监控,无需靠用户截图。(告警接入留运维侧)
--- ---
@ -287,8 +288,11 @@ curl -sS -X POST http://127.0.0.1:8787/acp/rpc \
0. ✅ **主根因修复**live 验证):让 OpenClaw 网关稳定加载 `openclaw-multi-session-plugins`——`openclaw plugins install` 从稳定路径重装 + 重启网关,确认启动日志 `6 plugins … openclaw-multi-session-plugins`、`xworkmate.*` 不再 `unknown method`。这是「采集AI资讯能产出」的前提详见 §4 0. ✅ **主根因修复**live 验证):让 OpenClaw 网关稳定加载 `openclaw-multi-session-plugins`——`openclaw plugins install` 从稳定路径重装 + 重启网关,确认启动日志 `6 plugins … openclaw-multi-session-plugins`、`xworkmate.*` 不再 `unknown method`。这是「采集AI资讯能产出」的前提详见 §4
1. ✅ **当天止血**(已合并 mainT1 + T2入口配置+ T3 + T4 + T6客户端+ session.prepare 数字 code 降级,消除"30min 必断 / 路由漏配 / 无限 running / 停不掉"。 1. ✅ **当天止血**(已合并 mainT1 + T2入口配置+ T3 + T4 + T6客户端+ session.prepare 数字 code 降级,消除"30min 必断 / 路由漏配 / 无限 running / 停不掉"。
说明session.prepare 数字 code 降级仍有价值——当插件**未**加载时,让 bridge 优雅 fallback 而非硬失败;插件加载后走真实 plugin 路径。 说明session.prepare 数字 code 降级仍有价值——当插件**未**加载时,让 bridge 优雅 fallback 而非硬失败;插件加载后走真实 plugin 路径。
2. ✅ **健壮性加固**(本地验证 commit `2333c3e`T7 + T8 + T9bridge 持久 run 仓与 WS 解耦),把网关短暂不可达 / 抖动收敛为「有界续轮询 → deadline 终态」,而非无限运行/丢结果。 2. ✅ **健壮性加固**commit `2333c3e`T7 + T8 + T9bridge 持久 run 仓与 WS 解耦),把网关短暂不可达 / 抖动收敛为「有界续轮询 → deadline 终态」,而非无限运行/丢结果。
3. **跟进(待办)**T5 + T10断连续跑语义、T11 + T12可观测性、T8b跨进程重启持久化`xworkmate.jobs.*` / 磁盘);运行态校验:每次替换 bridge 二进制 / 网关重启后,核对 `/api/ping.commit` 与网关 `N plugins` 列表。 3. ✅ **断连语义 + 可观测**commit `0a50621`T10socket-closed 带 retryable/poll+ T5App 轮询瞬断有界续轮询)+ T11runId 日志)+ T12`/api/ping.metrics` 计数)。
4. **剩余**
- **S1已回退待重做**:缺省 `expectedArtifactDirs` 会让「期望产物但实际无产物」的 run 卡在「等待导出」(破坏 E2E 测试)。根因是 `openClawTaskGetRequiresArtifactExport` 把「有 expectedArtifactDirs」等同「必须导出/阻塞」。**正确做法**:解耦「扫描提示」与「阻塞式导出要求」——让缺省目录只驱动插件的兜底扫描、不触发 bridge 的等待导出。需单独一轮、对全 E2E 套件验证。
- **T8b跨进程重启持久化**:把 per-session run 仓落磁盘 / 接 `xworkmate.jobs.*`,让 bridge **进程重启**后仍能回放终态。当前内存仓已覆盖「WS 抖动 / 网关瞬断」(同进程内),跨重启是较小边际收益、较大复杂度(序列化 / 启动加载 / 过期清理 / 并发),建议作为独立一轮带测试做。
> 回归对照:本目录 `00-review-env-and-matrix.md` 第 2 节"通用验收标准"中"长任务执行期间状态流 / 取消 / 重试稳定""同一任务重复执行 3 次不卡死",即本规划的回归出口。 > 回归对照:本目录 `00-review-env-and-matrix.md` 第 2 节"通用验收标准"中"长任务执行期间状态流 / 取消 / 重试稳定""同一任务重复执行 3 次不卡死",即本规划的回归出口。
> 产物交付链artifact scope / workspace 路径)的独立缺陷与修复,见 `openclaw-gateway-e2e-regression/ROOT_CAUSE_ANALYSIS.md` > 产物交付链artifact scope / workspace 路径)的独立缺陷与修复,见 `openclaw-gateway-e2e-regression/ROOT_CAUSE_ANALYSIS.md`
@ -305,10 +309,10 @@ curl -sS -X POST http://127.0.0.1:8787/acp/rpc \
**验证**:启动日志 `http server listening (6 plugins: … openclaw-multi-session-plugins)``inspect` 的 `Source` 变为 `~/.openclaw/extensions/…/dist/index.js`、provenance 警告消失;`xworkmate.session.prepare` 经 bridge 返回**真实插件响应**`fallback=null`、带 `mapping`、`artifactScope=tasks/draft_s0verify/s0-run`),不再走 bridge 的 `local-session-prepare` 降级。 **验证**:启动日志 `http server listening (6 plugins: … openclaw-multi-session-plugins)``inspect` 的 `Source` 变为 `~/.openclaw/extensions/…/dist/index.js`、provenance 警告消失;`xworkmate.session.prepare` 经 bridge 返回**真实插件响应**`fallback=null`、带 `mapping`、`artifactScope=tasks/draft_s0verify/s0-run`),不再走 bridge 的 `local-session-prepare` 降级。
收尾:`~/.openclaw/extensions/` 现为真实目录(非 /tmp 软链),重启/重启后不再丢插件;建议把它纳入部署(`deploy_gateway_openclaw`)从仓库 `openclaw-multi-session-plugins` 安装,避免再被软链到临时盘。 收尾:`~/.openclaw/extensions/` 现为真实目录(非 /tmp 软链),重启/重启后不再丢插件;建议把它纳入部署(`deploy_gateway_openclaw`)从仓库 `openclaw-multi-session-plugins` 安装,避免再被软链到临时盘。
- **S1 `expectedArtifactDirs` 为空导致根目录兜底失效 — ✅ 已修复并 live 验证commit `0280893`** - **S1 `expectedArtifactDirs` 为空导致根目录兜底失效 — ⚠️ 一版本已合并后回退commit `0280893` → 回退于 `81f65e3`**
根因live 的 session mapping 为 `expectedArtifactDirs:[]`而插件对「agent 把产物写到 workspace 根 `reports/`/`artifacts/` 而非 task scope」的兜底扫描**依赖 `expectedArtifactDirs`**;为空 → 兜底形同虚设 → 即便 agent 产出也收不到,表现「暂无文件」。 根因live 的 session mapping 为 `expectedArtifactDirs:[]`而插件对「agent 把产物写到 workspace 根 `reports/`/`artifacts/` 而非 task scope」的兜底扫描**依赖 `expectedArtifactDirs`**;为空 → 兜底形同虚设 → 即便 agent 产出也收不到,表现「暂无文件」。
修复:`orchestrator.go openClawArtifactContractForParams` 在「任务期望产物(`requiresExport` 或推断出 `requiredExts`)但未声明目录」时补缺省 `["reports/","artifacts/","exports/"]` 并置 `requiresExport=true`;纯聊天不受影响(`defaultOpenClawExpectedArtifactDirs`,含单测 `orchestrator_s1_artifact_dirs_test.go` **回退原因**:当时的实现给所有「推断出 requiredExts」的任务补缺省目录并置 `requiresExport=true`,导致 gateway run 成功但**实际无产物**时卡在「等待 artifact 导出」(`TestHTTPHandlerGatewayOpenClawHandlesFiveConcurrentE2ECases` 等转红)。阻塞来自 `openClawTaskGetRequiresArtifactExport` 把「有 expectedArtifactDirs」等同「必须导出」
验证提交「采集AI资讯保存md」→ `requiresArtifactExport=true`、`expectedArtifactDirs=['reports/','artifacts/','exports/']`(修复前为 `[]` **正确做法(待重做)**:解耦「扫描提示」与「阻塞式导出」——缺省目录只驱动插件兜底扫描、不触发 bridge 等待导出;或仅在客户端**显式**声明 `requiredArtifactExtensions` 时启用。需单独一轮、对全 E2E 套件验证后再上
- **S2 `no_native_task_record` 状态歧义**`xworkmate.tasks.get` 的真值来自「gateway host task registry 有该 run 的 detached task」**或**「artifact 已存在」。live 中 chat.send 成功但 gateway 无 native task recordagent 可能以 inline chat 执行、未注册可查 task且无产物 → 插件回 `no_native_task_record`bridge 只能靠 T7 兜底续轮询到 deadline**无法区分「还在跑」与「跑完没产物」**。 - **S2 `no_native_task_record` 状态歧义**`xworkmate.tasks.get` 的真值来自「gateway host task registry 有该 run 的 detached task」**或**「artifact 已存在」。live 中 chat.send 成功但 gateway 无 native task recordagent 可能以 inline chat 执行、未注册可查 task且无产物 → 插件回 `no_native_task_record`bridge 只能靠 T7 兜底续轮询到 deadline**无法区分「还在跑」与「跑完没产物」**。
改进:①确认 gateway 侧 chat.send 是否应产出 detached taskagent 配置/ `tasks.*` 注册);②插件/bridge 在 `no_native_task_record` 且超过最小执行时长时,下发更明确的 `running(no-record)` vs `completed(no-artifact)` 语义,配合 §5 T9 deadline 收口。 改进:①确认 gateway 侧 chat.send 是否应产出 detached taskagent 配置/ `tasks.*` 注册);②插件/bridge 在 `no_native_task_record` 且超过最小执行时长时,下发更明确的 `running(no-record)` vs `completed(no-artifact)` 语义,配合 §5 T9 deadline 收口。
@ -317,3 +321,42 @@ curl -sS -X POST http://127.0.0.1:8787/acp/rpc \
- **S3 三元组一致性(已知约束)** — 插件严校 `sessionKey/runId/artifactScope` 三者一致(`exportArtifacts.ts:126`),且 bridge 的 openclawSessionKey 由 `agent:main:` + appThreadKey 组成。**调用方/探针不要预带 `agent:main:` 前缀**(否则双前缀 → `artifactScope does not match`。bridge `taskGetParamsWithSessionScope` 已负责补齐保持其为唯一可信来源App/探针只传 `sessionId=draft:<id>` + `runId` - **S3 三元组一致性(已知约束)** — 插件严校 `sessionKey/runId/artifactScope` 三者一致(`exportArtifacts.ts:126`),且 bridge 的 openclawSessionKey 由 `agent:main:` + appThreadKey 组成。**调用方/探针不要预带 `agent:main:` 前缀**(否则双前缀 → `artifactScope does not match`。bridge `taskGetParamsWithSessionScope` 已负责补齐保持其为唯一可信来源App/探针只传 `sessionId=draft:<id>` + `runId`
- **S4 运行态可观测** — 沿用 §5 T11/T12bridge `/api/ping.commit`、网关 `N plugins` 列表、`openclaw plugins inspect` 三处纳入健康检查;`runId` 贯穿 App→bridge→插件→gateway 日志,便于定位断点落在四层中的哪一层。 - **S4 运行态可观测** — 沿用 §5 T11/T12bridge `/api/ping.commit`、网关 `N plugins` 列表、`openclaw plugins inspect` 三处纳入健康检查;`runId` 贯穿 App→bridge→插件→gateway 日志,便于定位断点落在四层中的哪一层。
---
## 8. 2026-06-27 Cases 0005 全面验收执行日志(进行中)
> 执行计划:`docs/plans/2026-06-27-cases-00-05-gateway-turn-acceptance.md`。本节只记录脱敏后的运行证据API Key、Bridge Token、账号密码不写入仓库。
> 追溯参考:`.xcodeinsight/context/repo-summary.md`、`.xcodeinsight/index/risk-index.md`、`.xcodeinsight/index/callchain-index.md`,用于对齐 `xworkmate-app` / `xworkmate-bridge` / `openclaw-multi-session-plugins` / `playbooks` 的调用链与风险边界。
### 8.1 当前目标与状态
| 阶段 | 状态 | 当前证据 / 下一步 |
|---|---|---|
| 仓库与运行态基线 | 🟡 进行中 | App `main=ca9cba6`;存在本轮未提交的 T5 + 文档改动,保留并纳入测试 |
| 本地 all-in-one 部署 | 🟡 进行中 | 首轮在稳定插件目录幂等迁移处失败;修复已提交 `xworkspace-console` main`50c2d85` + `5093e21`),本地修复版正在重跑 |
| Gateway Turn 定向回归 | 🟢 已通过 | T5 两条新增定向测试通过;完整 `assistant_execution_target_test.dart` 74 条通过 |
| Cases 0005 真实任务 | ⏳ 待执行 | 每项记录 runId、终态、耗时、结构/Artifact、重复与失败收口 |
| 提交 / push / CI | ⏳ 待执行 | 完成全量回归后提交;网络瞬态失败自动有界重试 |
### 8.2 08:47 CST 基线快照
- Bridge`127.0.0.1:8787` 正在监听launchd `plus.svc.xworkspace.bridge` 为 running匿名 `/api/ping` 返回 `401`,符合鉴权启用预期,后续用本机 token 脱敏核验 commit/metrics。
- Gateway`127.0.0.1:18789` 正在监听launchd `ai.openclaw.gateway` 为 running。
- 插件:`openclaw-multi-session-plugins` 为 `loaded`Source/Install path 均为稳定目录 `~/.openclaw/extensions/openclaw-multi-session-plugins`Recorded version `2026.6.1`S0 的临时目录问题当前未复发。
- 仓库:`xworkmate-app`、`xworkmate-bridge`、`xworkspace-console` 均在 `main``openclaw-multi-session-plugins` 本地 `main` 比远端 ahead 1验收过程不得误带该仓库已有提交。
- 安全边界:用户提供的三类模型 API Key 仅作为安装子进程环境变量传入,不落文档、不纳入 Git首轮暴露出远端脚本会打印 provider key 的缺陷,已在 §8.3 记录并修复本地源码。
### 8.3 08:54 CST 首轮发现与修复
- **T5 测试缺口已补**:旧测试仍断言 OpenClaw `tasks.get` 第一次 `ACP_HTTP_CONNECTION_CLOSED` 就立即失败与「有界续轮询」新契约冲突。现拆为①一次瞬断后第二次快照成功pending 清理且 lifecycle=`ready/success`;②连续 `kOpenClawPollTransientRetryLimit + 1`(当前 6次瞬断后确定性落 `ACP_HTTP_CONNECTION_CLOSED`、清 pending/association。两条定向测试均 `All tests passed!`
- **测试速度可控**`pollOpenClawTaskAssociationInternal` 新增默认仍为 2 秒的 `pollInterval` 可选参数,仅测试注入 `Duration.zero`,生产重试节奏不变。
- **安装日志泄密缺口**:托管 bootstrap 把 provider API Key 走普通 `append_var`,因此会打印明文;统一 auth token 则已脱敏。这不是模型调用失败原因,但违反安装安全边界。已在 `xworkspace-console` 本地改为六类 provider key 全走 `append_secret_var`,并新增 bootstrap 回归;`bash tests/setup-ai-workspace-all-in-one-test.sh` 全部通过。当前正在运行的脚本来自修复前远端,最终文档不记录任何 key 值。
### 8.4 08:59 CST 部署幂等修复与 App 完整定向回归
- 首轮 all-in-one 在 `Link openclaw-multi-session-plugins to extensions (macOS)` 失败S0 已把目标改成稳定真实目录,而旧 patch 仍强制 `state: link` 指向 `/tmp`/源码目录Ansible 正确拒绝 directory→symlink。自动重跑无法修复结构性错误因此中止第二轮。
- `xworkspace-console` 修复macOS patch 现在会识别并移除旧临时 symlink、确保 `~/.openclaw/extensions/openclaw-multi-session-plugins` 为真实目录、只复制构建产物/manifest并执行 `openclaw plugins install <stable-path> --force` 记录 provenance不再把 S0 修复倒退成临时链接。
- bootstrap 本地执行优先采用同 checkout 的 `patch-macos-playbooks.py`,远端 fallback 增加 cache-busting避免 main 刚提交后又下载到 5 分钟 CDN 旧版本。
- 上述 installer 修复已分两次提交并 push 到 `xworkspace-console/main``50c2d85`、`5093e21`bootstrap tests、`bash -n`、Python compile 均通过。
- App 完整定向回归:`flutter test test/runtime/assistant_execution_target_test.dart` → **74 tests / All tests passed**。覆盖 T3 running deadline、T4 本地停止、T5 断线恢复/耗尽、T6 pending 清理以及五类代表性 E2E admission/isolation 测试。

View File

@ -0,0 +1,73 @@
# Cases 0005 Gateway Turn Stability Acceptance Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** 在本机 all-in-one 运行态中完成 `docs/cases/0005` 的真实任务验收,修复发现的 Gateway Turn 稳定性缺口,并把全过程证据回写到 Case 06。
**Architecture:** 以 Case 06 的 App → Bridge → multi-session plugin → OpenClaw gateway 四层链路为主线。先验证安装与运行态,再以定向测试锁住取消、超时、断线与 pending 清理语义,最后通过真实 Bridge 请求执行五类任务并核对状态、Artifact、重复执行与失败收口。
**Tech Stack:** Flutter/Dart、Go、OpenClaw gateway、JSON-RPC/SSE、macOS launchd、GitHub Actions。
---
### Task 1: Baseline and progress ledger
**Files:**
- Modify: `docs/cases/06-gateway-turn-stability-and-robustness.md`
- Create: `docs/plans/2026-06-27-cases-00-05-gateway-turn-acceptance.md`
**Steps:**
1. Capture branch, commit, dirty files, installed services, `/api/ping`, gateway plugin provenance, and Case 00 prerequisites.
2. Add a timestamped execution ledger to Case 06 without storing API keys or auth tokens.
3. Record each later command by outcome and evidence, not by secrets-bearing command line.
### Task 2: Local all-in-one deployment
**Files:**
- Modify only when a reproducible installer/runtime defect is found; keep each fix in its owning repository.
- Modify: `docs/cases/06-gateway-turn-stability-and-robustness.md`
**Steps:**
1. Inspect the hosted bootstrap before execution and verify its final origin/redirect.
2. Run the installer with `DEEPSEEK_API_KEY`, `NVIDIA_API_KEY`, and `OLLAMA_API_KEY` supplied only through the child-process environment.
3. Retry transient download/network failures with bounded backoff.
4. Verify Bridge `/api/ping.commit`, gateway port `18789`, plugin stable path/provenance, and `xworkmate.session.prepare` behavior.
### Task 3: Gateway Turn regression implementation
**Files:**
- Modify: `lib/app/app_controller_desktop_runtime_helpers.dart`
- Modify: `lib/app/app_controller_desktop_thread_actions.dart`
- Test: `test/runtime/assistant_execution_target_test.dart`
- Test other runtime files only when the failure belongs there.
**Steps:**
1. Add failing tests for bounded `ACP_HTTP_CONNECTION_CLOSED` polling recovery and retry exhaustion.
2. Run the focused test and confirm the new assertion fails before implementation where practicable.
3. Implement the smallest change that preserves pending during bounded retry and deterministically reaches terminal state after exhaustion.
4. Run `flutter test test/runtime/assistant_execution_target_test.dart` and related gateway runtime tests.
5. Run `scripts/ci/run_layered_tests.sh` to match the repository CI path.
### Task 4: Cases 0005 live acceptance
**Files:**
- Modify: `docs/cases/06-gateway-turn-stability-and-robustness.md`
**Steps:**
1. Verify Case 00 connectivity and error behavior against the local Bridge/Gateway runtime.
2. Execute Case 01 three times; validate Markdown structure, terminal status, non-empty Artifact, and isolation.
3. Execute Cases 0205 with the documented prompts and a deterministic local test image for Case 02.
4. Exercise cancellation, invalid auth/unreachable endpoint, repeated runs, exact item/page counts, and Artifact retrieval where supported.
5. Record run IDs, durations, terminal status, artifact paths/counts, and any scoped deviations.
### Task 5: Full verification and delivery
**Files:**
- Modify: `docs/cases/06-gateway-turn-stability-and-robustness.md`
**Steps:**
1. Run formatting/analyzer and the full relevant Flutter suite; rerun failed tests individually to distinguish deterministic regressions from flakes.
2. Review the complete diff and confirm no credential values or generated secrets are tracked.
3. Commit cohesive changes on `main` with explicit messages.
4. Push to `origin/main`; retry transient push failures with bounded exponential backoff.
5. Verify the newest GitHub Actions run(s) and append final acceptance status/blockers to Case 06.

View File

@ -65,6 +65,11 @@ const Map<String, Duration> kOpenClawRunningPollBudgets = <String, Duration>{
}; };
const String kOpenClawRunningPollTimeoutCode = 'OPENCLAW_RUN_POLL_TIMEOUT'; const String kOpenClawRunningPollTimeoutCode = 'OPENCLAW_RUN_POLL_TIMEOUT';
// T5docs/cases/06 §5 Appbridge ACP_HTTP_CONNECTION_CLOSED
// ·
// / getTask
const int kOpenClawPollTransientRetryLimit = 5;
bool openClawArtifactPathHasRequiredExtension(String path, String extension) { bool openClawArtifactPathHasRequiredExtension(String path, String extension) {
final normalizedPath = path.trim().toLowerCase(); final normalizedPath = path.trim().toLowerCase();
final normalizedExtension = extension.trim().toLowerCase().replaceFirst( final normalizedExtension = extension.trim().toLowerCase().replaceFirst(

View File

@ -748,6 +748,7 @@ extension AppControllerDesktopThreadActions on AppController {
required String sessionKey, required String sessionKey,
required AssistantExecutionTarget target, required AssistantExecutionTarget target,
required OpenClawTaskAssociation association, required OpenClawTaskAssociation association,
Duration pollInterval = const Duration(seconds: 2),
}) async { }) async {
var current = association; var current = association;
var firstAttempt = true; var firstAttempt = true;
@ -755,6 +756,8 @@ extension AppControllerDesktopThreadActions on AppController {
double? artifactSyncStartedAtMs; double? artifactSyncStartedAtMs;
// T3: running running // T3: running running
double? runningPollFirstAtMs; double? runningPollFirstAtMs;
// T5: getTask
var transientRetries = 0;
final existingThread = taskThreadForSessionInternal(sessionKey); final existingThread = taskThreadForSessionInternal(sessionKey);
if (association.status.trim().toLowerCase() == 'syncing-artifacts') { if (association.status.trim().toLowerCase() == 'syncing-artifacts') {
artifactSyncStartedAtMs = existingThread?.lastArtifactSyncAtMs; artifactSyncStartedAtMs = existingThread?.lastArtifactSyncAtMs;
@ -767,7 +770,7 @@ extension AppControllerDesktopThreadActions on AppController {
return; return;
} }
if (!firstAttempt) { if (!firstAttempt) {
await Future<void>.delayed(const Duration(seconds: 2)); await Future<void>.delayed(pollInterval);
} }
firstAttempt = false; firstAttempt = false;
try { try {
@ -779,6 +782,8 @@ extension AppControllerDesktopThreadActions on AppController {
if (disposedInternal) { if (disposedInternal) {
return; return;
} }
// T5:
transientRetries = 0;
final nextAssociation = final nextAssociation =
result.openClawTaskAssociation ?? result.openClawTaskAssociation ??
current.copyWith( current.copyWith(
@ -904,6 +909,19 @@ extension AppControllerDesktopThreadActions on AppController {
if (disposedInternal) { if (disposedInternal) {
return; return;
} }
// T5: Appbridge ACP_HTTP_CONNECTION_CLOSED
// ·bridge T7/T9 run
// / Appbridge
if (aiGatewayPendingSessionKeysInternal.contains(sessionKey) &&
interruptedAcpHttpTransportCodeInternal(error) ==
'ACP_HTTP_CONNECTION_CLOSED' &&
transientRetries < kOpenClawPollTransientRetryLimit) {
transientRetries += 1;
// pending 2s getTask
recomputeTasksInternal();
notifyIfActiveInternal();
continue;
}
if (aiGatewayPendingSessionKeysInternal.contains(sessionKey)) { if (aiGatewayPendingSessionKeysInternal.contains(sessionKey)) {
await applyGatewayChatFailureInternal( await applyGatewayChatFailureInternal(
sessionKey: sessionKey, sessionKey: sessionKey,

View File

@ -4087,56 +4087,131 @@ void main() {
}, },
); );
test('OpenClaw task snapshot failure records a terminal result', () async { test('OpenClaw task snapshot transient failure retries and recovers', () async {
final fakeGoTaskService = _RecordingGoTaskServiceClient() final fakeGoTaskService = _RecordingGoTaskServiceClient()
..outcomes.add(
const GoTaskServiceResult(
success: true,
message: '',
turnId: 'turn-openclaw-poll-failed',
raw: <String, dynamic>{
'success': true,
'status': 'running',
'sessionId': 'openclaw-poll-failed-task',
'threadId': 'openclaw-poll-failed-task',
'appThreadKey': 'openclaw-poll-failed-task',
'openclawSessionKey': 'agent:main:openclaw-poll-failed-task',
'turnId': 'turn-openclaw-poll-failed',
'runId': 'run-openclaw-poll-failed',
'artifactScope':
'tasks/agent:main:openclaw-poll-failed-task/run-openclaw-poll-failed',
'artifactDirectory':
'/tmp/tasks/agent:main:openclaw-poll-failed-task/run-openclaw-poll-failed',
'gatewayProviderId': 'openclaw',
'runtimeBudgetMinutes': 1,
},
errorMessage: '',
resolvedModel: '',
route: GoTaskServiceRoute.externalAcpSingle,
),
)
..taskOutcomes.add( ..taskOutcomes.add(
const GatewayAcpException( const GatewayAcpException(
'ACP HTTP connection closed before the OpenClaw task snapshot returned', 'ACP HTTP connection closed before the OpenClaw task snapshot returned',
code: 'ACP_HTTP_CONNECTION_CLOSED', code: 'ACP_HTTP_CONNECTION_CLOSED',
), ),
)
..taskOutcomes.add(
const GoTaskServiceResult(
success: true,
message: 'recovered task result',
turnId: 'turn-openclaw-poll-recovered',
raw: <String, dynamic>{
'success': true,
'status': 'completed',
'turnId': 'turn-openclaw-poll-recovered',
'runId': 'run-openclaw-poll-recovered',
'output': 'recovered task result',
},
errorMessage: '',
resolvedModel: '',
route: GoTaskServiceRoute.externalAcpSingle,
),
); );
final controller = _connectedGatewayController(fakeGoTaskService); final controller = _connectedGatewayController(fakeGoTaskService);
addTearDown(controller.dispose); addTearDown(controller.dispose);
const association = OpenClawTaskAssociation(
sessionId: 'openclaw-poll-recovered',
threadId: 'openclaw-poll-recovered',
turnId: 'turn-openclaw-poll-recovered',
runId: 'run-openclaw-poll-recovered',
artifactScope:
'tasks/agent:main:openclaw-poll-recovered/run-openclaw-poll-recovered',
artifactDirectory:
'/tmp/tasks/agent:main:openclaw-poll-recovered/run-openclaw-poll-recovered',
gatewayProviderId: 'openclaw',
startedAtMs: 0,
status: 'running',
appThreadKey: 'openclaw-poll-recovered',
openclawSessionKey: 'agent:main:openclaw-poll-recovered',
);
controller.upsertTaskThreadInternal(
association.sessionId,
executionTarget: AssistantExecutionTarget.gateway,
selectedProvider: SingleAgentProvider.openclaw,
lifecycleStatus: 'running',
lastResultCode: 'running',
openClawTaskAssociation: association,
);
controller.aiGatewayPendingSessionKeysInternal.add(association.sessionId);
await _selectGatewaySession(controller, 'openclaw-poll-failed-task'); await controller.pollOpenClawTaskAssociationInternal(
sessionKey: association.sessionId,
await expectLater( target: AssistantExecutionTarget.gateway,
controller association: association,
.sendChatMessage('输出 PDF') pollInterval: Duration.zero,
.timeout(const Duration(seconds: 2)),
completes,
); );
await Future<void>.delayed(const Duration(milliseconds: 100)); final recoveredThread = controller.requireTaskThreadForSessionInternal(
association.sessionId,
);
expect(fakeGoTaskService.getTaskCount, 2);
expect(recoveredThread.lifecycleState.status, 'ready');
expect(recoveredThread.lifecycleState.lastResultCode, 'success');
expect(
controller.assistantSessionHasPendingRun(association.sessionId),
isFalse,
);
});
test('OpenClaw task snapshot retry exhaustion is terminal', () async {
final fakeGoTaskService = _RecordingGoTaskServiceClient();
for (
var attempt = 0;
attempt <= kOpenClawPollTransientRetryLimit;
attempt += 1
) {
fakeGoTaskService.taskOutcomes.add(
const GatewayAcpException(
'ACP HTTP connection closed before the OpenClaw task snapshot returned',
code: 'ACP_HTTP_CONNECTION_CLOSED',
),
);
}
final controller = _connectedGatewayController(fakeGoTaskService);
addTearDown(controller.dispose);
const association = OpenClawTaskAssociation(
sessionId: 'openclaw-poll-exhausted',
threadId: 'openclaw-poll-exhausted',
turnId: 'turn-openclaw-poll-exhausted',
runId: 'run-openclaw-poll-exhausted',
artifactScope:
'tasks/agent:main:openclaw-poll-exhausted/run-openclaw-poll-exhausted',
artifactDirectory:
'/tmp/tasks/agent:main:openclaw-poll-exhausted/run-openclaw-poll-exhausted',
gatewayProviderId: 'openclaw',
startedAtMs: 0,
status: 'running',
appThreadKey: 'openclaw-poll-exhausted',
openclawSessionKey: 'agent:main:openclaw-poll-exhausted',
);
controller.upsertTaskThreadInternal(
association.sessionId,
executionTarget: AssistantExecutionTarget.gateway,
selectedProvider: SingleAgentProvider.openclaw,
lifecycleStatus: 'running',
lastResultCode: 'running',
openClawTaskAssociation: association,
);
controller.aiGatewayPendingSessionKeysInternal.add(association.sessionId);
await controller.pollOpenClawTaskAssociationInternal(
sessionKey: association.sessionId,
target: AssistantExecutionTarget.gateway,
association: association,
pollInterval: Duration.zero,
);
final failedThread = controller.requireTaskThreadForSessionInternal( final failedThread = controller.requireTaskThreadForSessionInternal(
'openclaw-poll-failed-task', association.sessionId,
);
expect(
fakeGoTaskService.getTaskCount,
kOpenClawPollTransientRetryLimit + 1,
); );
expect(failedThread.lifecycleState.status, 'ready'); expect(failedThread.lifecycleState.status, 'ready');
expect( expect(
@ -4146,13 +4221,9 @@ void main() {
expect(failedThread.lastArtifactSyncStatus, 'failed'); expect(failedThread.lastArtifactSyncStatus, 'failed');
expect(failedThread.openClawTaskAssociation, isNull); expect(failedThread.openClawTaskAssociation, isNull);
expect( expect(
controller.assistantSessionHasPendingRun('openclaw-poll-failed-task'), controller.assistantSessionHasPendingRun(association.sessionId),
isFalse, isFalse,
); );
expect(
controller.chatMessages.map((message) => message.text).join('\n'),
contains('ACP_HTTP_CONNECTION_CLOSED'),
);
}); });
test('OpenClaw running poll times out and clears pending state', () async { test('OpenClaw running poll times out and clears pending state', () async {
@ -4999,6 +5070,7 @@ Future<void> _waitForThreadLastResultCode(
class _RecordingGoTaskServiceClient implements GoTaskServiceClient { class _RecordingGoTaskServiceClient implements GoTaskServiceClient {
int executeCount = 0; int executeCount = 0;
int getTaskCount = 0;
final List<GoTaskServiceRequest> requests = <GoTaskServiceRequest>[]; final List<GoTaskServiceRequest> requests = <GoTaskServiceRequest>[];
final List<GoTaskServiceUpdate> updatesBeforeNextOutcome = final List<GoTaskServiceUpdate> updatesBeforeNextOutcome =
<GoTaskServiceUpdate>[]; <GoTaskServiceUpdate>[];
@ -5054,6 +5126,7 @@ class _RecordingGoTaskServiceClient implements GoTaskServiceClient {
required OpenClawTaskAssociation association, required OpenClawTaskAssociation association,
required GoTaskServiceRoute route, required GoTaskServiceRoute route,
}) async { }) async {
getTaskCount += 1;
if (taskOutcomes.isNotEmpty) { if (taskOutcomes.isNotEmpty) {
final outcome = taskOutcomes.removeAt(0); final outcome = taskOutcomes.removeAt(0);
if (outcome is GoTaskServiceResult) { if (outcome is GoTaskServiceResult) {