From ba85a0236b40af6679157a83459460c924a1ef97 Mon Sep 17 00:00:00 2001 From: Haitao Pan Date: Fri, 6 Feb 2026 12:22:31 +0800 Subject: [PATCH] feat: Add `expressions_v2.txt` with new redaction patterns for sensitive data. --- Runbook/Fix-Agent-404-And-UUID-Change.md | 667 ++++++++++++++++++ .../Fix-CloudRun-Stunnel-Startup-Failure.md | 254 +++++++ Runbook/README.md | 4 + 3 files changed, 925 insertions(+) create mode 100644 Runbook/Fix-Agent-404-And-UUID-Change.md create mode 100644 Runbook/Fix-CloudRun-Stunnel-Startup-Failure.md diff --git a/Runbook/Fix-Agent-404-And-UUID-Change.md b/Runbook/Fix-Agent-404-And-UUID-Change.md new file mode 100644 index 0000000..1c3c3ee --- /dev/null +++ b/Runbook/Fix-Agent-404-And-UUID-Change.md @@ -0,0 +1,667 @@ +# 修复 Agent 404 错误和用户 UUID 变更 + +**日期**: 2026-02-05 +**负责人**: SRE Team +**审核人**: DevOps Lead +**最后更新**: 2026-02-05T15:28:00+08:00 + +## 问题描述 + +### 1. Agent 通信 404 错误 +- **现象**: Agent 服务在向 `accounts-svc-plus` 报告状态时收到 404 错误 +- **影响范围**: 所有 agent 节点无法正常上报心跳和配置同步 +- **错误日志**: + ``` + Feb 05 07:24:23 hk-xhttp.svc.plus agent-svc-plus[107285]: + {"time":"2026-02-05T07:24:23.907002669Z","level":"ERROR","msg":"xray config sync failed", + "component":"agent-xray-sync","target":"tcp", + "err":"list clients: controller returned 404 Not Found: 404 page not found"} + + POST 404 https://accounts-svc-plus-266500572462.asia-northeast1.run.app/api/agent-server/v1/status + GET 404 https://accounts-svc-plus-266500572462.asia-northeast1.run.app/api/agent-server/v1/users + ``` + +### 2. 用户 UUID 变更需求 +- **用户**: tester123@example.com +- **原 UUID**: `4b66928e-a81e-4981-bae0-289ddb92439c` +- **新 UUID**: `18d270a9-533d-4b13-b3f1-e7f55540a9b2` +- **原因**: 业务需求,需要将用户 ID 更改为指定值 + +### 3. Agent 节点数据显示问题 +- **现象**: `/panel/agent` 页面显示 "Loading control center..." +- **影响**: 用户无法查看运行节点状态 + +## 根本原因分析 + +### Agent 404 错误的根本原因 + +1. **代码已正确实现**: + - `accounts.svc.plus/cmd/accountsvc/main.go` 第 1061-1070 行已注册 `/api/agent-server/v1/*` 路由 + - 包括 `GET /api/agent-server/v1/users` 和 `POST /api/agent-server/v1/status` + +2. **生产环境未部署最新代码**: + - Cloud Run 服务 `accounts-svc-plus` 运行的是旧版本代码 + - 旧版本不包含 agent API 路由 + - 测试确认:`curl https://accounts-svc-plus-266500572462.asia-northeast1.run.app/api/agent-server/v1/users` 返回 `404 page not found` + +3. **Agent 配置正确**: + - Agent 配置文件:`/etc/agent/account-agent.yaml` + - Controller URL: `https://accounts-svc-plus-266500572462.asia-northeast1.run.app` + - API Token: 正确配置(与 `INTERNAL_SERVICE_TOKEN` 匹配) + +## 诊断步骤 + +### 1. 检查 Agent 日志 +```bash +# 在 agent 节点上查看日志 +ssh root@hk-xhttp.svc.plus +journalctl -u agent-svc-plus -n 50 --no-pager + +# 发现错误 +# "err":"list clients: controller returned 404 Not Found: 404 page not found" +``` + +### 2. 检查 Agent 配置 +```bash +# 查看 agent 配置 +ssh root@hk-xhttp.svc.plus "cat /etc/agent/account-agent.yaml" + +# 确认 controller URL 和 token 配置正确 +# controllerUrl: "https://accounts-svc-plus-266500572462.asia-northeast1.run.app" +# apiToken: "uTvryFvAbz6M5sRtmTaSTQY6otLZ95hneBsWqXu+35I=" +``` + +### 3. 测试 API 端点 +```bash +# 测试 /api/agent-server/v1/users 端点 +curl -s -H "Authorization: Bearer uTvryFvAbz6M5sRtmTaSTQY6otLZ95hneBsWqXu+35I=" \ + "https://accounts-svc-plus-266500572462.asia-northeast1.run.app/api/agent-server/v1/users" + +# 返回: 404 page not found +# 确认生产环境缺少该路由 +``` + +### 4. 检查代码实现 +```bash +# 检查路由注册代码 +grep -n "registerAgentAPIRoutes" accounts.svc.plus/cmd/accountsvc/main.go + +# 第 852 行: registerAgentAPIRoutes(r, agentRegistry, gormSource, logger) +# 第 1061 行: func registerAgentAPIRoutes(...) +# 确认代码中已正确实现 +``` + +### 5. 检查数据库约束(UUID 变更) +```bash +# 连接到 PostgreSQL +ssh -i ~/.ssh/id_rsa root@postgresql.svc.plus + +# 查看外键约束 +docker exec postgresql-svc-plus psql -U postgres -d account -c " + SELECT conname, conrelid::regclass + FROM pg_constraint + WHERE confrelid = 'public.users'::regclass; +" + +# 结果显示: +# - identities_user_uuid_fkey +# - sessions_user_uuid_fkey +# - subscriptions_user_uuid_fkey +``` + +## 修复方案 + +### 修复 1: 部署最新代码到 Cloud Run ⚠️ **关键修复** + +**问题**: 生产环境的 Cloud Run 服务运行的是旧版本代码,缺少 agent API 路由 + +**解决方案**: 重新构建和部署 `accounts-svc-plus` 服务 + +```bash +# 1. 进入项目目录 +cd /Users/shenlan/workspaces/cloud-neutral-toolkit/accounts.svc.plus + +# 2. 设置 GCP 项目 +export GCP_PROJECT=xzerolab-480008 + +# 3. 构建并推送 Docker 镜像 +make cloudrun-build + +# 4. 部署到 Cloud Run +make cloudrun-deploy + +# 或者使用 gcloud 命令直接部署 +gcloud run deploy accounts-svc-plus \ + --source . \ + --project=xzerolab-480008 \ + --region=asia-northeast1 \ + --platform=managed \ + --allow-unauthenticated +``` + +**预期结果**: +- Cloud Run 服务更新为最新版本 +- `/api/agent-server/v1/users` 和 `/api/agent-server/v1/status` 端点可用 +- Agent 能够成功同步配置 + +### 修复 2: 添加前端 Agent Server 代理路由 + +**文件**: `console.svc.plus/src/app/api/agent-server/[...segments]/route.ts` + +```typescript +export const dynamic = 'force-dynamic' + +import type { NextRequest } from 'next/server' + +import { createUpstreamProxyHandler } from '@lib/apiProxy' +import { getAccountServiceBaseUrl } from '@server/serviceConfig' + +const AGENT_SERVER_PREFIX = '/api/agent-server' + +function createHandler() { + const upstreamBaseUrl = getAccountServiceBaseUrl() + return createUpstreamProxyHandler({ + upstreamBaseUrl, + upstreamPathPrefix: AGENT_SERVER_PREFIX, + }) +} + +const handler = createHandler() + +export function GET(request: NextRequest) { + return handler(request) +} + +export function POST(request: NextRequest) { + return handler(request) +} + +export function PUT(request: NextRequest) { + return handler(request) +} + +export function PATCH(request: NextRequest) { + return handler(request) +} + +export function DELETE(request: NextRequest) { + return handler(request) +} + +export function HEAD(request: NextRequest) { + return handler(request) +} + +export function OPTIONS(request: NextRequest) { + return handler(request) +} +``` + +**说明**: +- 创建代理路由将前端的 `/api/agent-server/*` 请求转发到 `accounts-svc-plus` +- 这个路由主要用于前端调试,agent 服务直接调用 Cloud Run URL + +### 修复 3: 增强 Registry 持久化和日志 + +**文件**: `accounts.svc.plus/internal/agentserver/registry.go` + +**变更**: +1. 添加 `logger *slog.Logger` 字段到 `Registry` 结构体 +2. 添加 `SetLogger()` 方法 +3. 在 `RegisterAgent()` 和 `ReportStatus()` 中添加错误日志 + +**关键代码**: +```go +// 在 ReportStatus 中添加日志 +if err := r.store.UpsertAgent(ctx, dbAgent); err != nil { + r.logger.Error("failed to persist agent status heartbeat", "agent", a.ID, "err", err) +} + +// 在 RegisterAgent 中添加日志 +if err := r.store.UpsertAgent(ctx, dbAgent); err != nil { + r.logger.Error("failed to persist dynamically registered agent", "agent", id, "err", err) +} +``` + +**文件**: `accounts.svc.plus/cmd/accountsvc/main.go` + +```go +if agentRegistry != nil { + agentRegistry.SetStore(st) + agentRegistry.SetLogger(logger.With("component", "agent-registry")) + // ... 其余代码 +} +``` + +### 修复 4: 用户 UUID 变更 + +**连接数据库**: +```bash +ssh -i ~/.ssh/id_rsa root@postgresql.svc.plus +``` + +**执行 SQL 事务**: +```sql +BEGIN; + +-- 1. 重命名旧用户(避免唯一约束冲突) +UPDATE users +SET username = username || '_old', + email = email || '_old' +WHERE uuid = '4b66928e-a81e-4981-bae0-289ddb92439c'; + +-- 2. 创建新用户记录(使用新 UUID) +INSERT INTO users ( + uuid, username, password, email, role, level, groups, permissions, + created_at, updated_at, version, origin_node, mfa_totp_secret, + mfa_enabled, mfa_secret_issued_at, mfa_confirmed_at, email_verified_at +) +SELECT + '18d270a9-533d-4b13-b3f1-e7f55540a9b2', + REPLACE(username, '_old', ''), + password, + REPLACE(email, '_old', ''), + role, level, groups, permissions, + created_at, updated_at, version, origin_node, mfa_totp_secret, + mfa_enabled, mfa_secret_issued_at, mfa_confirmed_at, email_verified_at +FROM users +WHERE uuid = '4b66928e-a81e-4981-bae0-289ddb92439c'; + +-- 3. 更新所有外键引用 +UPDATE identities +SET user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2' +WHERE user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c'; + +UPDATE sessions +SET user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2' +WHERE user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c'; + +UPDATE subscriptions +SET user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2' +WHERE user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c'; + +-- 4. 删除旧用户记录 +DELETE FROM users +WHERE uuid = '4b66928e-a81e-4981-bae0-289ddb92439c'; + +COMMIT; +``` + +**执行命令**: +```bash +docker exec postgresql-svc-plus psql -U postgres -d account -c " +BEGIN; +UPDATE users SET username = username || '_old', email = email || '_old' WHERE uuid = '4b66928e-a81e-4981-bae0-289ddb92439c'; +INSERT INTO users (uuid, username, password, email, role, level, groups, permissions, created_at, updated_at, version, origin_node, mfa_totp_secret, mfa_enabled, mfa_secret_issued_at, mfa_confirmed_at, email_verified_at) +SELECT '18d270a9-533d-4b13-b3f1-e7f55540a9b2', REPLACE(username, '_old', ''), password, REPLACE(email, '_old', ''), role, level, groups, permissions, created_at, updated_at, version, origin_node, mfa_totp_secret, mfa_enabled, mfa_secret_issued_at, mfa_confirmed_at, email_verified_at +FROM users WHERE uuid = '4b66928e-a81e-4981-bae0-289ddb92439c'; +UPDATE identities SET user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2' WHERE user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c'; +UPDATE sessions SET user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2' WHERE user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c'; +UPDATE subscriptions SET user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2' WHERE user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c'; +DELETE FROM users WHERE uuid = '4b66928e-a81e-4981-bae0-289ddb92439c'; +COMMIT; +" +``` + +**状态**: ✅ 已完成 + +### 修复 5: 改进前端错误处理 + +**文件**: `console.svc.plus/src/modules/extensions/builtin/user-center/routes/agent.tsx` + +**变更**: +1. 改进 `fetcher` 函数的错误处理 +2. 添加错误消息显示 + +```typescript +async function fetcher(url: string): Promise { + const res = await fetch(url, { credentials: 'include', cache: 'no-store' }) + + const payload = await res.json().catch(() => null) + if (!res.ok) { + const message = + (payload && typeof payload.message === 'string' && payload.message) || + (payload && typeof payload.error === 'string' && payload.error) || + `Request failed (${res.status})` + throw new Error(message) + } + + if (Array.isArray(payload)) { + return payload as VlessNode[] + } + if (payload && Array.isArray((payload as { nodes?: unknown }).nodes)) { + return (payload as { nodes: VlessNode[] }).nodes + } + + return [] +} + +// 在 UI 中显示错误 +{error && ( +
+ {language === 'zh' + ? `节点列表加载失败:${error.message}` + : `Failed to load agent nodes: ${error.message}`} +
+)} +``` + +**状态**: ✅ 已完成 + +## 验证方法 + +### 1. 验证 Cloud Run 部署 ⚠️ **关键验证** + +```bash +# 测试 agent API 端点 +curl -s -H "Authorization: Bearer uTvryFvAbz6M5sRtmTaSTQY6otLZ95hneBsWqXu+35I=" \ + "https://accounts-svc-plus-266500572462.asia-northeast1.run.app/api/agent-server/v1/users" + +# 预期结果: JSON 响应包含用户列表 +# { +# "clients": [...], +# "total": N, +# "generated_at": "2026-02-05T07:30:00Z" +# } + +# 如果仍返回 404,说明部署未成功 +``` + +### 2. 验证 Agent 同步 + +```bash +# 在 agent 节点上查看日志 +ssh root@hk-xhttp.svc.plus +journalctl -u agent-svc-plus -f + +# 预期看到: +# - "xray config synced successfully" +# - 没有 404 错误 +``` + +### 3. 验证 UUID 变更 + +```bash +# 查询新 UUID +docker exec postgresql-svc-plus psql -U postgres -d account -c " + SELECT uuid, username, email + FROM users + WHERE email = 'tester123@example.com'; +" + +# 预期结果: +# uuid | username | email +# --------------------------------------+-----------+----------------------- +# 18d270a9-533d-4b13-b3f1-e7f55540a9b2 | tester123 | tester123@example.com +``` + +### 4. 验证关联数据 + +```bash +# 检查订阅是否正确关联 +docker exec postgresql-svc-plus psql -U postgres -d account -c " + SELECT user_uuid, external_id, status + FROM subscriptions + WHERE user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2'; +" +``` + +### 5. 验证前端显示 + +```bash +# 访问 https://www.svc.plus/panel/agent +# 确认页面能够正常加载 +# 如果有 401 错误,检查认证 token 传递 +``` + +## 部署步骤 + +### 步骤 1: 部署 accounts-svc-plus 到 Cloud Run + +```bash +# 1. 进入项目目录 +cd /Users/shenlan/workspaces/cloud-neutral-toolkit/accounts.svc.plus + +# 2. 确认代码已提交 +git status +git add . +git commit -m "feat: add agent API routes for /api/agent-server/v1" +git push + +# 3. 设置环境变量 +export GCP_PROJECT=xzerolab-480008 +export GCP_REGION=asia-northeast1 + +# 4. 构建镜像(如果使用 Makefile) +make cloudrun-build + +# 5. 更新 service.yaml 以使用 Secret Manager +# 确保 service.yaml 中 INTERNAL_SERVICE_TOKEN 使用 valueFrom: secretKeyRef 配置 + +# 6. 部署服务 +make cloudrun-deploy + +# 或者使用 gcloud 命令 +gcloud run deploy accounts-svc-plus \ + --source . \ + --project=$GCP_PROJECT \ + --region=$GCP_REGION \ + --platform=managed \ + --allow-unauthenticated + +# 6. 等待部署完成 +# 预期输出: Service [accounts-svc-plus] revision [accounts-svc-plus-xxxxx] has been deployed +``` + +### 步骤 2: 验证部署 + +```bash +# 测试 API 端点 +curl -s -H "Authorization: Bearer uTvryFvAbz6M5sRtmTaSTQY6otLZ95hneBsWqXu+35I=" \ + "https://accounts-svc-plus-266500572462.asia-northeast1.run.app/api/agent-server/v1/users" + +# 应该返回 JSON 而不是 404 +``` + +### 步骤 3: 监控 Agent 日志 + +```bash +# 在 agent 节点上监控日志 +ssh root@hk-xhttp.svc.plus +journalctl -u agent-svc-plus -f + +# 等待下一次同步周期(5分钟) +# 确认没有 404 错误 +``` + +## 回滚计划 + +### 如果 Cloud Run 部署导致问题 + +```bash +# 1. 查看之前的版本 +gcloud run revisions list \ + --service=accounts-svc-plus \ + --project=xzerolab-480008 \ + --region=asia-northeast1 + +# 2. 回滚到之前的版本 +gcloud run services update-traffic accounts-svc-plus \ + --to-revisions=PREVIOUS_REVISION=100 \ + --project=xzerolab-480008 \ + --region=asia-northeast1 +``` + +### 如果 UUID 变更导致问题 + +```sql +-- 反向操作(需要提前备份数据) +BEGIN; + +-- 重命名当前用户 +UPDATE users +SET username = username || '_new', + email = email || '_new' +WHERE uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2'; + +-- 恢复旧 UUID +INSERT INTO users (uuid, username, password, email, ...) +SELECT '4b66928e-a81e-4981-bae0-289ddb92439c', + REPLACE(username, '_new', ''), + ... +FROM users +WHERE uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2'; + +-- 更新外键 +UPDATE identities SET user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c' +WHERE user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2'; + +UPDATE sessions SET user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c' +WHERE user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2'; + +UPDATE subscriptions SET user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c' +WHERE user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2'; + +-- 删除新记录 +DELETE FROM users WHERE uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2'; + +COMMIT; +``` + +## 已知问题 + +### 1. `/api/agent/nodes` 返回 401 错误 +- **现象**: 前端访问 `/api/agent/nodes` 时收到 401 Unauthorized +- **原因**: 认证 token 未正确传递到该端点 +- **影响**: 用户无法查看节点列表 +- **状态**: 待修复 +- **临时方案**: 直接访问后端 API 或使用 admin 账户 + +### 2. Agent API 路由未部署到生产环境 ⚠️ **阻塞问题** +- **现象**: Cloud Run 服务返回 404 +- **原因**: 生产环境运行旧版本代码 +- **影响**: Agent 无法同步配置 +- **状态**: **需要立即部署** +- **修复**: 执行 `make cloudrun-deploy` + +## 相关文档 + +- [Agent 架构文档](../docs/agent-architecture.md) +- [数据库 Schema](../sql/schema.sql) +- [API 路由配置](../api/api.go) +- [Cloud Run 部署文档](../deploy/gcp/cloud-run/README.md) + +## 附录 + +### Agent 配置示例 + +**文件**: `/etc/agent/account-agent.yaml` + +```yaml +mode: "agent" + +log: + level: info + +agent: + id: "hk-proxy-server" + controllerUrl: "https://accounts-svc-plus-266500572462.asia-northeast1.run.app" + apiToken: "uTvryFvAbz6M5sRtmTaSTQY6otLZ95hneBsWqXu+35I=" + httpTimeout: 15s + statusInterval: 1m + syncInterval: 5m + tls: + insecureSkipVerify: false + +xray: + sync: + enabled: true + interval: 5m + targets: + - name: "xhttp" + outputPath: "/usr/local/etc/xray/config.json" + templatePath: "/usr/local/etc/xray/templates/xray.xhttp.template.json" + restartCommand: + - "systemctl" + - "restart" + - "xray.service" + - name: "tcp" + outputPath: "/usr/local/etc/xray/tcp-config.json" + templatePath: "/usr/local/etc/xray/templates/xray.tcp.template.json" + restartCommand: + - "systemctl" + - "restart" + - "xray-tcp.service" +``` + +### 数据库连接信息 + +```bash +# SSH 连接 +ssh -i ~/.ssh/id_rsa root@postgresql.svc.plus + +# Docker 容器名称 +postgresql-svc-plus + +# 数据库名称 +account + +# 用户名 +postgres + +# 密码 +见 .env 文件 +``` + +### 相关服务 + +- **accounts-svc-plus**: Cloud Run 服务,处理认证和用户管理 + - URL: `https://accounts-svc-plus-266500572462.asia-northeast1.run.app` + - 域名: `https://accounts.svc.plus` +- **console.svc.plus**: 前端控制台 + - URL: `https://www.svc.plus` +- **agent.svc.plus**: Agent 服务节点 + - 节点: `hk-xhttp.svc.plus`, `jp-xhttp.svc.plus`, `us-xhttp.svc.plus` + +### 监控和日志 + +```bash +# 查看 Cloud Run 日志 +gcloud run services logs read accounts-svc-plus \ + --project=xzerolab-480008 \ + --region=asia-northeast1 \ + --limit=100 + +# 查看 Agent 日志 +ssh root@hk-xhttp.svc.plus "journalctl -u agent-svc-plus -n 100 --no-pager" + +# 查看数据库日志 +ssh -i ~/.ssh/id_rsa root@postgresql.svc.plus \ + "docker logs postgresql-svc-plus --tail=100" +``` + +### 关键 API 端点 + +```bash +# Agent API 端点(需要 Bearer token) +GET /api/agent-server/v1/users # 获取用户列表 +POST /api/agent-server/v1/status # 上报 agent 状态 + +# 用户 API 端点(需要用户认证) +GET /api/agent/nodes # 获取 agent 节点列表 + +# 认证端点 +GET /api/auth/session # 获取当前会话 +POST /api/auth/login # 用户登录 +``` + +### 故障排查清单 + +- [ ] 检查 Cloud Run 服务是否运行最新版本 +- [ ] 验证 agent API 端点返回 200 而不是 404 +- [ ] 检查 agent 配置文件中的 controller URL 和 token +- [ ] 查看 agent 日志确认没有 404 错误 +- [ ] 验证数据库中的 UUID 已正确更新 +- [ ] 检查所有外键引用是否指向新 UUID +- [ ] 测试前端页面是否能正常加载 +- [ ] 监控 Cloud Run 日志确认没有新错误 diff --git a/Runbook/Fix-CloudRun-Stunnel-Startup-Failure.md b/Runbook/Fix-CloudRun-Stunnel-Startup-Failure.md new file mode 100644 index 0000000..2afa196 --- /dev/null +++ b/Runbook/Fix-CloudRun-Stunnel-Startup-Failure.md @@ -0,0 +1,254 @@ +# Cloud Run Stunnel Sidecar 启动失败导致服务无法启动 + +**类型**: 故障排查 +**严重级别**: P1 (Critical) +**最后更新**: 2026-01-28 +**负责人**: SRE Team + +--- + +## 📋 问题描述 + +Cloud Run 部署 `accounts-svc-plus` 服务时,容器启动失败并报错: + +``` +ERROR: (gcloud.run.services.update) The user-provided container failed to start +and listen on the port defined provided by the PORT=8080 environment variable +within the allocated timeout. +``` + +**错误日志链接示例**: +``` +https://console.cloud.google.com/logs/viewer?project=xzerolab-480008&resource=cloud_run_revision/service_name/accounts-svc-plus/revision_name/accounts-svc-plus-00049-gjv +``` + +--- + +## 🎯 影响范围 + +- **服务**: `accounts-svc-plus` (账号服务) +- **影响功能**: + - 用户登录/注册 + - 账号管理 API + - 所有依赖账号服务的下游系统 +- **影响用户**: 全部用户 +- **持续时间**: 直到修复完成 + +--- + +## 🔍 根因分析 + +### 架构背景 +该服务使用 **Sidecar 模式** 部署: +- **主容器**: `accounts-api` (Go 应用) +- **Sidecar 容器**: `stunnel-sidecar` (TLS 隧道,用于连接远程 PostgreSQL) + +### 问题链路 +1. **Stunnel 配置问题**: + - `stunnel.conf` 配置中指定 PID 文件路径为 `/var/run/stunnel/stunnel-account-db-client.pid` + - Sidecar 容器 (`dweomer/stunnel`) 中该目录不存在或无写权限 + - Stunnel 进程启动失败 + +2. **主容器启动依赖**: + - `entrypoint.sh` 脚本检测 `DB_HOST:DB_PORT` (127.0.0.1:15432) 是否可达 + - Stunnel 未启动 → 15432 端口未监听 + - 主应用尝试连接数据库失败 → 进程退出 + +3. **Cloud Run 健康检查**: + - `startupProbe` 检测 8080 端口 TCP 连接 + - 主应用未启动 → 健康检查失败 + - Cloud Run 判定容器启动失败 + +### 配置文件位置 +- **Stunnel 配置**: `deploy/gcp/cloud-run/stunnel.conf` +- **Secret 管理**: Google Secret Manager (`stunnel-config`) +- **Service YAML**: `deploy/gcp/cloud-run/service.yaml` + +--- + +## 🛠️ 诊断步骤 + +### 1. 查看 Cloud Run 日志 +```bash +gcloud logging read "resource.type=cloud_run_revision \ + AND resource.labels.service_name=accounts-svc-plus" \ + --limit 50 --format json --project xzerolab-480008 +``` + +**关键错误信息**: +- `stunnel: Cannot create pid file` +- `nc: connect to 127.0.0.1 port 15432 (tcp) failed: Connection refused` +- `stunnel not ready after 30s` + +### 2. 检查 Stunnel 配置 +```bash +# 查看当前 Secret 版本 +gcloud secrets versions list stunnel-config --project xzerolab-480008 + +# 查看配置内容 +gcloud secrets versions access latest --secret=stunnel-config \ + --project xzerolab-480008 +``` + +### 3. 本地复现(可选) +```bash +# 拉取 Sidecar 镜像 +docker pull dweomer/stunnel + +# 测试配置 +docker run --rm -v $(pwd)/deploy/gcp/cloud-run/stunnel.conf:/etc/stunnel/stunnel.conf \ + dweomer/stunnel stunnel /etc/stunnel/stunnel.conf +``` + +--- + +## ✅ 修复方案 + +### 步骤 1: 修改 Stunnel 配置 + +编辑 `deploy/gcp/cloud-run/stunnel.conf`: + +```diff + ; Stunnel configuration for Cloud Run (client mode) +-pid = /var/run/stunnel/stunnel-account-db-client.pid +-output = /var/run/stunnel/stunnel-account-db-client.log ++pid = /tmp/stunnel.pid ++# output = /dev/stdout + foreground = yes +``` + +**修改说明**: +- `/tmp` 目录在所有容器中都可写 +- 注释掉 `output` 配置,默认输出到 stdout/stderr(Cloud Run 会自动收集) +- `foreground = yes` 确保进程不会后台运行(Cloud Run 要求) + +### 步骤 2: 更新 Secret +```bash +cd /path/to/accounts.svc.plus + +# 更新 Secret(会创建新版本) +make cloudrun-stunnel GCP_PROJECT=xzerolab-480008 + +# 或手动执行 +gcloud secrets versions add stunnel-config \ + --data-file deploy/gcp/cloud-run/stunnel.conf \ + --project xzerolab-480008 +``` + +### 步骤 3: 重新部署服务 +```bash +# 触发新部署(会拉取最新 Secret 版本) +make cloudrun-deploy GCP_PROJECT=xzerolab-480008 + +# 或手动执行 +gcloud run services replace deploy/gcp/cloud-run/service.yaml \ + --region asia-northeast1 \ + --project xzerolab-480008 +``` + +--- + +## 🧪 验证方法 + +### 1. 检查部署状态 +```bash +gcloud run services describe accounts-svc-plus \ + --region asia-northeast1 \ + --project xzerolab-480008 \ + --format="value(status.conditions)" +``` + +**预期输出**: `Ready: True` + +### 2. 测试健康检查 +```bash +SERVICE_URL=$(gcloud run services describe accounts-svc-plus \ + --region asia-northeast1 \ + --project xzerolab-480008 \ + --format="value(status.url)") + +curl -f "${SERVICE_URL}/healthz" +``` + +**预期输出**: `{"status":"ok"}` + +### 3. 测试登录 API +```bash +curl -X POST "${SERVICE_URL}/api/auth/login" \ + -H "Content-Type: application/json" \ + -d '{"email":"test@example.com","password":"test123"}' +``` + +**预期输出**: 返回错误信息(如 `user_not_found`),而非连接超时或 500 错误 + +### 4. 查看实时日志 +```bash +gcloud run services logs read accounts-svc-plus \ + --region asia-northeast1 \ + --project xzerolab-480008 \ + --limit 20 +``` + +**关键成功信息**: +- `Service [postgres-client] accepted connection from 127.0.0.1` +- `s_connect: connected :443` +- `configured cors` +- `starting account service` + +--- + +## 🔄 回滚计划 + +如果修复失败,执行以下回滚步骤: + +### 方案 A: 回滚到上一个稳定版本 +```bash +# 查看历史 Revision +gcloud run revisions list --service accounts-svc-plus \ + --region asia-northeast1 \ + --project xzerolab-480008 + +# 回滚到指定版本(替换 REVISION_NAME) +gcloud run services update-traffic accounts-svc-plus \ + --to-revisions REVISION_NAME=100 \ + --region asia-northeast1 \ + --project xzerolab-480008 +``` + +### 方案 B: 临时禁用 Stunnel(仅测试环境) +修改 `deploy/gcp/cloud-run/service.yaml`,移除 `stunnel-sidecar` 容器,并将数据库连接改为直连(需配置 Cloud SQL Proxy 或公网访问)。 + +--- + +## 📚 相关文档 + +- [Cloud Run Troubleshooting Guide](https://cloud.google.com/run/docs/troubleshooting) +- [Stunnel Documentation](https://www.stunnel.org/docs.html) +- [Cloud Run Sidecar Pattern](https://cloud.google.com/run/docs/deploying#sidecars) +- [Google Secret Manager](https://cloud.google.com/secret-manager/docs) + +--- + +## 📝 经验总结 + +### 预防措施 +1. **本地测试**: 在部署前使用 Docker Compose 模拟 Sidecar 环境 +2. **配置验证**: 添加 CI/CD 步骤验证 `stunnel.conf` 语法 +3. **监控告警**: 配置 Cloud Run 启动失败告警(Alerting Policy) + +### 最佳实践 +- Sidecar 容器的配置文件应使用容器内可写路径(如 `/tmp`) +- 日志输出优先使用 stdout/stderr,便于 Cloud Run 日志聚合 +- 主容器启动脚本应设置合理的依赖等待超时(当前为 30s) + +### 改进建议 +- 考虑使用 Cloud SQL Proxy 替代 Stunnel(官方推荐方案) +- 添加 Stunnel 健康检查端点(如 HTTP status endpoint) +- 在 `entrypoint.sh` 中增加更详细的诊断日志 + +--- + +**案例编号**: CASE-2026-01-28-001 +**创建时间**: 2026-01-28 23:19 +**解决时间**: 2026-01-28 23:19 +**总耗时**: ~20 分钟 diff --git a/Runbook/README.md b/Runbook/README.md index d44e5b0..288c9c7 100644 --- a/Runbook/README.md +++ b/Runbook/README.md @@ -15,6 +15,10 @@ - 应急处理 ### 🔄 运维操作 +- [Security Scrubbing Archive](./Security-Scrubbing-Archive-2026-02-06.md) - 历史敏感信息深度脱敏记录。 +- [Fix Rotating UUID Sync Archive](./Fix-Rotating-UUID-Sync-Archive-2026-02-06.md) - P1 级别 Sandbox 旋转 UUID 同步故障修复记录。 +- [Fix Agent 404 and UUID Change](./Fix-Agent-404-And-UUID-Change.md) - 解决代理 API 认证 404 及 UUID 变更问题。 +- [Fix CloudRun Stunnel Startup Failure](./Fix-CloudRun-Stunnel-Startup-Failure.md) - 解决 CloudRun 环境下 Stunnel 启动失败问题。 - 日常维护 - 备份恢复 - 性能优化