8.6 KiB
Chain Map: Session Recovery
Repo chain: xworkmate-app → xworkmate-bridge
Recovery Scenarios
S1: App restart with running task
App starts
└─ AppController.restoreTaskThreads()
└─ ThreadStorage.loadAllThreads() → TaskThread[]
└─ For each thread with lifecycleStatus=running:
└─ resolveGatewayThreadConnectionState()
├─ thread has pendingTurnId?
│ ├─ Yes → pollBridgeTaskSnapshot(turnId)
│ │ └─ xworkmate.tasks.get({ sessionId, threadId, turnId })
│ │ ├─ Terminal snapshot found → apply result and mark ready
│ │ ├─ Session not found → mark failed
│ │ └─ No response → mark unrecovered
│ └─ No → mark ready (no pending turn)
│
└─ thread lifecycleStatus=queued?
└─ drainOpenClawGatewayQueue() → re-send
Key files:
lib/app/app_controller_desktop_thread_sessions.dart
lib/runtime/external_code_agent_acp_desktop_transport.dart
S2: Bridge restart (all sessions lost)
Bridge sends SSE close / WebSocket disconnects
└─ App: ExternalCodeAgentAcpDesktopTransport detects disconnect
└─ Enter recovery mode
├─ Attempt reconnection to bridge /acp
├─ If reconnected:
│ └─ Call xworkmate.tasks.get for each pending task
│ ├─ Task found → continue with snapshot
│ └─ Session not found (all sessions lost)
│ → Mark task as failed with ACP_BRIDGE_RESTART
└─ If cannot reconnect:
└─ Exponential backoff, max retries
→ Eventually mark as ACP_UNREACHABLE
Key files:
lib/runtime/gateway_runtime_core.dart (reconnection logic)
lib/runtime/external_code_agent_acp_desktop_transport.dart (recovery)
S3: Network interruption mid-task
SSE stream interrupted (network flap)
└─ App: Transport detects stream close without terminal
└─ Enter polling mode
└─ Every N seconds: xworkmate.tasks.get({ sessionId, threadId, turnId })
├─ Terminal snapshot → apply result, mark ready, stop polling
├─ Still running → continue polling
├─ Session not found → mark failed
└─ Max poll attempts reached → mark unrecovered
Critical parameters (check actual values in code):
- poll interval: ? seconds
- max poll attempts: ?
- total poll timeout: ?
Key files:
lib/runtime/external_code_agent_acp_desktop_transport.dart
S4: OpenClaw gateway unreachable
Bridge side:
└─ Gateway client: gatewayruntime/runtime.go
└─ scheduleReconnect() with 2s delay
└─ Suppressed for auth errors
└─ openClawSilentFailureExceeded() → 10 min timeout
└─ Mark task as OPENCLAW_GATEWAY_LOST
App side:
└─ Receives SSE session.update with status=failed
└─ applyGatewayChatResult() → lastResultCode=OPENCLAW_GATEWAY_LOST
└─ TaskThread lifecycleStatus → ready
S4a: Agent fails before producing output
OpenClaw agent_end(success=false, runId, error)
└─ openclaw-multi-session-plugins persists xworkmate.taskRuns[runId]
└─ App polls xworkmate.tasks.get
└─ Native detached-task record absent
└─ Plugin returns durable status=failed + sanitized error
└─ Bridge preserves terminal failure (no artifact wait)
└─ App clears pending and shows the model/provider error
expectedArtifactDirs remain scan hints. An empty reports/ or artifacts/
directory cannot convert this terminal failure back to running.
S5: App resend on OpenClaw lane busy
App: sendChatMessage() with executionTarget=gateway
└─ isOpenClawLaneIdle() → false (5 active tasks)
└─ queueOpenClawGatewayWork()
├─ lifecycleStatus = queued
├─ Position in queue: N (max 20)
├─ Queue timeout: 10 min
└─ drainOpenClawGatewayQueue()
├─ Poll for lane idle + position=0
├─ Lane becomes idle:
│ └─ Dequeue → send normally
└─ Queue timeout:
└─ lifecycleStatus = ready
└─ lastResultCode = OPENCLAW_GATEWAY_BUSY
Note: The app-side queue is SEPARATE from bridge-side admission gate.
Bridge also has its own 5/20/10min admission control.
Double queue scenario:
App queue (5/20) → waits → sends to bridge
Bridge queue (5/20) → waits → sends to OpenClaw
Potential issue: App queue drains after lane idle, but bridge gate
might also be busy → further delay not visible to app UI.
Recovery State Machine
stateDiagram-v2
[*] --> Running: task submitted
Running --> Running: SSE streaming
Running --> Lost_Connection: socket close / network flap
Lost_Connection --> Polling: xworkmate.tasks.get
Polling --> Recovered: terminal snapshot
Polling --> Polling: still running
Polling --> Session_Not_Found: bridge restarted
Polling --> Max_Retries: exceeded
Recovered --> Ready: result applied; artifact sync status records missing outputs
Session_Not_Found --> Failed: ACP_BRIDGE_RESTART
Max_Retries --> Failed: ACP_UNRECOVERABLE
Running --> Failed: task error / gateway lost
Failed --> Ready: error recorded
Bridge Session Store (Memory-Only)
xworkmate-bridge: internal/acp/types.go
Server struct {
sessions map[string]*session // ← IN-MEMORY ONLY
}
session struct {
id string
threadId string
turnId string
runId string
sessionKey string
openclaw *OpenClawTaskRecord // nil for non-gateway sessions
history []message
...
}
No persistence:
- Bridge restart → all sessions lost
- xworkmate.tasks.get returns "session not found"
- App must detect and mark as failed
Key Bridge RPC Methods for Recovery
| Method | Params | Returns |
|---|---|---|
xworkmate.tasks.get |
appThreadKey, openclawSessionKey, runId/taskId | Native task snapshot, durable agent_end run snapshot, or structured lookup error |
xworkmate.tasks.cancel |
appThreadKey, openclawSessionKey, runId/taskId | Cancel confirmation |
| Removed: Bridge task reassociation | artifactScope/runId-derived taskHandle | No longer supported; route through native task registry |
App Recovery Flow (Detailed)
resolveGatewayThreadConnectionState(thread)
├─ thread.lifecycleStatus == "queued"
│ └─ drainOpenClawGatewayQueue()
│
├─ thread.lifecycleStatus == "running"
│ ├─ thread.lastTurnId exists?
│ │ ├─ Yes → transport.pollBridgeTaskSnapshot(turnId)
│ │ │ └─ xworkmate.tasks.get:
│ │ │ ├─ completed/failed → applyGatewayChatResult() and mark ready
│ │ │ ├─ running → leave as running, continue SSE
│ │ │ └─ not found / error:
│ │ │ ├─ isBridgeAvailable()
│ │ │ │ ├─ Yes → bridge restarted, mark failed
│ │ │ │ └─ No → network issue, retry later
│ │ │ └─ set lifecycleStatus = ready
│ │ │ set lastResultCode = ACP_SESSION_NOT_FOUND
│ │ │
│ │ └─ No → set lifecycleStatus = ready (no pending turn)
│ │
│ └─ no lastTurnId → ready
│
└─ thread.lifecycleStatus == "ready" || "archived"
└─ No recovery needed
Fragile Points for Recovery
-
R1: Bridge restart detection: App must distinguish "bridge restarted, sessions lost" from "network temporarily down". Currently relies on
xworkmate.tasks.getreturning "not found" while bridge is reachable. -
R2: Double queuing: App has its own queue, bridge has admission gate. If both are congested, total wait time can exceed user expectations.
-
R3: Stale running state: If app crashes mid-task, on restart the thread shows lifecycleStatus=running. The xworkmate.tasks.get probe is the only way to resolve.
-
R4: Polling parameters: Hardcoded poll interval/retry values in
ExternalCodeAgentAcpDesktopTransportneed to align with bridge's task deadlines (10/30/60 min). If polling stops before deadline, app marks failed while task is still running. -
R5: OpenClaw handle expiration: The bridge's in-memory
OpenClawTaskRecordis not authoritative after restart. The plugin's SessionEntry-backed agent_end record preserves known terminal states; runs that ended before this record was written still fall back to the bounded deadline path.