feat: Add expressions_v2.txt with new redaction patterns for sensitive data.

This commit is contained in:
Haitao Pan 2026-02-06 12:22:31 +08:00
parent d0537acafd
commit ba85a0236b
3 changed files with 925 additions and 0 deletions

View File

@ -0,0 +1,667 @@
# 修复 Agent 404 错误和用户 UUID 变更
**日期**: 2026-02-05
**负责人**: SRE Team
**审核人**: DevOps Lead
**最后更新**: 2026-02-05T15:28:00+08:00
## 问题描述
### 1. Agent 通信 404 错误
- **现象**: Agent 服务在向 `accounts-svc-plus` 报告状态时收到 404 错误
- **影响范围**: 所有 agent 节点无法正常上报心跳和配置同步
- **错误日志**:
```
Feb 05 07:24:23 hk-xhttp.svc.plus agent-svc-plus[107285]:
{"time":"2026-02-05T07:24:23.907002669Z","level":"ERROR","msg":"xray config sync failed",
"component":"agent-xray-sync","target":"tcp",
"err":"list clients: controller returned 404 Not Found: 404 page not found"}
POST 404 https://accounts-svc-plus-266500572462.asia-northeast1.run.app/api/agent-server/v1/status
GET 404 https://accounts-svc-plus-266500572462.asia-northeast1.run.app/api/agent-server/v1/users
```
### 2. 用户 UUID 变更需求
- **用户**: tester123@example.com
- **原 UUID**: `4b66928e-a81e-4981-bae0-289ddb92439c`
- **新 UUID**: `18d270a9-533d-4b13-b3f1-e7f55540a9b2`
- **原因**: 业务需求,需要将用户 ID 更改为指定值
### 3. Agent 节点数据显示问题
- **现象**: `/panel/agent` 页面显示 "Loading control center..."
- **影响**: 用户无法查看运行节点状态
## 根本原因分析
### Agent 404 错误的根本原因
1. **代码已正确实现**
- `accounts.svc.plus/cmd/accountsvc/main.go` 第 1061-1070 行已注册 `/api/agent-server/v1/*` 路由
- 包括 `GET /api/agent-server/v1/users``POST /api/agent-server/v1/status`
2. **生产环境未部署最新代码**
- Cloud Run 服务 `accounts-svc-plus` 运行的是旧版本代码
- 旧版本不包含 agent API 路由
- 测试确认:`curl https://accounts-svc-plus-266500572462.asia-northeast1.run.app/api/agent-server/v1/users` 返回 `404 page not found`
3. **Agent 配置正确**
- Agent 配置文件:`/etc/agent/account-agent.yaml`
- Controller URL: `https://accounts-svc-plus-266500572462.asia-northeast1.run.app`
- API Token: 正确配置(与 `INTERNAL_SERVICE_TOKEN` 匹配)
## 诊断步骤
### 1. 检查 Agent 日志
```bash
# 在 agent 节点上查看日志
ssh root@hk-xhttp.svc.plus
journalctl -u agent-svc-plus -n 50 --no-pager
# 发现错误
# "err":"list clients: controller returned 404 Not Found: 404 page not found"
```
### 2. 检查 Agent 配置
```bash
# 查看 agent 配置
ssh root@hk-xhttp.svc.plus "cat /etc/agent/account-agent.yaml"
# 确认 controller URL 和 token 配置正确
# controllerUrl: "https://accounts-svc-plus-266500572462.asia-northeast1.run.app"
# apiToken: "uTvryFvAbz6M5sRtmTaSTQY6otLZ95hneBsWqXu+35I="
```
### 3. 测试 API 端点
```bash
# 测试 /api/agent-server/v1/users 端点
curl -s -H "Authorization: Bearer uTvryFvAbz6M5sRtmTaSTQY6otLZ95hneBsWqXu+35I=" \
"https://accounts-svc-plus-266500572462.asia-northeast1.run.app/api/agent-server/v1/users"
# 返回: 404 page not found
# 确认生产环境缺少该路由
```
### 4. 检查代码实现
```bash
# 检查路由注册代码
grep -n "registerAgentAPIRoutes" accounts.svc.plus/cmd/accountsvc/main.go
# 第 852 行: registerAgentAPIRoutes(r, agentRegistry, gormSource, logger)
# 第 1061 行: func registerAgentAPIRoutes(...)
# 确认代码中已正确实现
```
### 5. 检查数据库约束UUID 变更)
```bash
# 连接到 PostgreSQL
ssh -i ~/.ssh/id_rsa root@postgresql.svc.plus
# 查看外键约束
docker exec postgresql-svc-plus psql -U postgres -d account -c "
SELECT conname, conrelid::regclass
FROM pg_constraint
WHERE confrelid = 'public.users'::regclass;
"
# 结果显示:
# - identities_user_uuid_fkey
# - sessions_user_uuid_fkey
# - subscriptions_user_uuid_fkey
```
## 修复方案
### 修复 1: 部署最新代码到 Cloud Run ⚠️ **关键修复**
**问题**: 生产环境的 Cloud Run 服务运行的是旧版本代码,缺少 agent API 路由
**解决方案**: 重新构建和部署 `accounts-svc-plus` 服务
```bash
# 1. 进入项目目录
cd /Users/shenlan/workspaces/cloud-neutral-toolkit/accounts.svc.plus
# 2. 设置 GCP 项目
export GCP_PROJECT=xzerolab-480008
# 3. 构建并推送 Docker 镜像
make cloudrun-build
# 4. 部署到 Cloud Run
make cloudrun-deploy
# 或者使用 gcloud 命令直接部署
gcloud run deploy accounts-svc-plus \
--source . \
--project=xzerolab-480008 \
--region=asia-northeast1 \
--platform=managed \
--allow-unauthenticated
```
**预期结果**:
- Cloud Run 服务更新为最新版本
- `/api/agent-server/v1/users``/api/agent-server/v1/status` 端点可用
- Agent 能够成功同步配置
### 修复 2: 添加前端 Agent Server 代理路由
**文件**: `console.svc.plus/src/app/api/agent-server/[...segments]/route.ts`
```typescript
export const dynamic = 'force-dynamic'
import type { NextRequest } from 'next/server'
import { createUpstreamProxyHandler } from '@lib/apiProxy'
import { getAccountServiceBaseUrl } from '@server/serviceConfig'
const AGENT_SERVER_PREFIX = '/api/agent-server'
function createHandler() {
const upstreamBaseUrl = getAccountServiceBaseUrl()
return createUpstreamProxyHandler({
upstreamBaseUrl,
upstreamPathPrefix: AGENT_SERVER_PREFIX,
})
}
const handler = createHandler()
export function GET(request: NextRequest) {
return handler(request)
}
export function POST(request: NextRequest) {
return handler(request)
}
export function PUT(request: NextRequest) {
return handler(request)
}
export function PATCH(request: NextRequest) {
return handler(request)
}
export function DELETE(request: NextRequest) {
return handler(request)
}
export function HEAD(request: NextRequest) {
return handler(request)
}
export function OPTIONS(request: NextRequest) {
return handler(request)
}
```
**说明**:
- 创建代理路由将前端的 `/api/agent-server/*` 请求转发到 `accounts-svc-plus`
- 这个路由主要用于前端调试agent 服务直接调用 Cloud Run URL
### 修复 3: 增强 Registry 持久化和日志
**文件**: `accounts.svc.plus/internal/agentserver/registry.go`
**变更**:
1. 添加 `logger *slog.Logger` 字段到 `Registry` 结构体
2. 添加 `SetLogger()` 方法
3. 在 `RegisterAgent()``ReportStatus()` 中添加错误日志
**关键代码**:
```go
// 在 ReportStatus 中添加日志
if err := r.store.UpsertAgent(ctx, dbAgent); err != nil {
r.logger.Error("failed to persist agent status heartbeat", "agent", a.ID, "err", err)
}
// 在 RegisterAgent 中添加日志
if err := r.store.UpsertAgent(ctx, dbAgent); err != nil {
r.logger.Error("failed to persist dynamically registered agent", "agent", id, "err", err)
}
```
**文件**: `accounts.svc.plus/cmd/accountsvc/main.go`
```go
if agentRegistry != nil {
agentRegistry.SetStore(st)
agentRegistry.SetLogger(logger.With("component", "agent-registry"))
// ... 其余代码
}
```
### 修复 4: 用户 UUID 变更
**连接数据库**:
```bash
ssh -i ~/.ssh/id_rsa root@postgresql.svc.plus
```
**执行 SQL 事务**:
```sql
BEGIN;
-- 1. 重命名旧用户(避免唯一约束冲突)
UPDATE users
SET username = username || '_old',
email = email || '_old'
WHERE uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
-- 2. 创建新用户记录(使用新 UUID
INSERT INTO users (
uuid, username, password, email, role, level, groups, permissions,
created_at, updated_at, version, origin_node, mfa_totp_secret,
mfa_enabled, mfa_secret_issued_at, mfa_confirmed_at, email_verified_at
)
SELECT
'18d270a9-533d-4b13-b3f1-e7f55540a9b2',
REPLACE(username, '_old', ''),
password,
REPLACE(email, '_old', ''),
role, level, groups, permissions,
created_at, updated_at, version, origin_node, mfa_totp_secret,
mfa_enabled, mfa_secret_issued_at, mfa_confirmed_at, email_verified_at
FROM users
WHERE uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
-- 3. 更新所有外键引用
UPDATE identities
SET user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2'
WHERE user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
UPDATE sessions
SET user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2'
WHERE user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
UPDATE subscriptions
SET user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2'
WHERE user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
-- 4. 删除旧用户记录
DELETE FROM users
WHERE uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
COMMIT;
```
**执行命令**:
```bash
docker exec postgresql-svc-plus psql -U postgres -d account -c "
BEGIN;
UPDATE users SET username = username || '_old', email = email || '_old' WHERE uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
INSERT INTO users (uuid, username, password, email, role, level, groups, permissions, created_at, updated_at, version, origin_node, mfa_totp_secret, mfa_enabled, mfa_secret_issued_at, mfa_confirmed_at, email_verified_at)
SELECT '18d270a9-533d-4b13-b3f1-e7f55540a9b2', REPLACE(username, '_old', ''), password, REPLACE(email, '_old', ''), role, level, groups, permissions, created_at, updated_at, version, origin_node, mfa_totp_secret, mfa_enabled, mfa_secret_issued_at, mfa_confirmed_at, email_verified_at
FROM users WHERE uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
UPDATE identities SET user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2' WHERE user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
UPDATE sessions SET user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2' WHERE user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
UPDATE subscriptions SET user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2' WHERE user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
DELETE FROM users WHERE uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
COMMIT;
"
```
**状态**: ✅ 已完成
### 修复 5: 改进前端错误处理
**文件**: `console.svc.plus/src/modules/extensions/builtin/user-center/routes/agent.tsx`
**变更**:
1. 改进 `fetcher` 函数的错误处理
2. 添加错误消息显示
```typescript
async function fetcher(url: string): Promise<VlessNode[]> {
const res = await fetch(url, { credentials: 'include', cache: 'no-store' })
const payload = await res.json().catch(() => null)
if (!res.ok) {
const message =
(payload && typeof payload.message === 'string' && payload.message) ||
(payload && typeof payload.error === 'string' && payload.error) ||
`Request failed (${res.status})`
throw new Error(message)
}
if (Array.isArray(payload)) {
return payload as VlessNode[]
}
if (payload && Array.isArray((payload as { nodes?: unknown }).nodes)) {
return (payload as { nodes: VlessNode[] }).nodes
}
return []
}
// 在 UI 中显示错误
{error && (
<div className="rounded-xl border border-[color:var(--color-danger-border)] bg-[var(--color-danger-muted)]/30 px-4 py-3 text-sm text-[var(--color-danger-foreground)]">
{language === 'zh'
? `节点列表加载失败:${error.message}`
: `Failed to load agent nodes: ${error.message}`}
</div>
)}
```
**状态**: ✅ 已完成
## 验证方法
### 1. 验证 Cloud Run 部署 ⚠️ **关键验证**
```bash
# 测试 agent API 端点
curl -s -H "Authorization: Bearer uTvryFvAbz6M5sRtmTaSTQY6otLZ95hneBsWqXu+35I=" \
"https://accounts-svc-plus-266500572462.asia-northeast1.run.app/api/agent-server/v1/users"
# 预期结果: JSON 响应包含用户列表
# {
# "clients": [...],
# "total": N,
# "generated_at": "2026-02-05T07:30:00Z"
# }
# 如果仍返回 404说明部署未成功
```
### 2. 验证 Agent 同步
```bash
# 在 agent 节点上查看日志
ssh root@hk-xhttp.svc.plus
journalctl -u agent-svc-plus -f
# 预期看到:
# - "xray config synced successfully"
# - 没有 404 错误
```
### 3. 验证 UUID 变更
```bash
# 查询新 UUID
docker exec postgresql-svc-plus psql -U postgres -d account -c "
SELECT uuid, username, email
FROM users
WHERE email = 'tester123@example.com';
"
# 预期结果:
# uuid | username | email
# --------------------------------------+-----------+-----------------------
# 18d270a9-533d-4b13-b3f1-e7f55540a9b2 | tester123 | tester123@example.com
```
### 4. 验证关联数据
```bash
# 检查订阅是否正确关联
docker exec postgresql-svc-plus psql -U postgres -d account -c "
SELECT user_uuid, external_id, status
FROM subscriptions
WHERE user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2';
"
```
### 5. 验证前端显示
```bash
# 访问 https://www.svc.plus/panel/agent
# 确认页面能够正常加载
# 如果有 401 错误,检查认证 token 传递
```
## 部署步骤
### 步骤 1: 部署 accounts-svc-plus 到 Cloud Run
```bash
# 1. 进入项目目录
cd /Users/shenlan/workspaces/cloud-neutral-toolkit/accounts.svc.plus
# 2. 确认代码已提交
git status
git add .
git commit -m "feat: add agent API routes for /api/agent-server/v1"
git push
# 3. 设置环境变量
export GCP_PROJECT=xzerolab-480008
export GCP_REGION=asia-northeast1
# 4. 构建镜像(如果使用 Makefile
make cloudrun-build
# 5. 更新 service.yaml 以使用 Secret Manager
# 确保 service.yaml 中 INTERNAL_SERVICE_TOKEN 使用 valueFrom: secretKeyRef 配置
# 6. 部署服务
make cloudrun-deploy
# 或者使用 gcloud 命令
gcloud run deploy accounts-svc-plus \
--source . \
--project=$GCP_PROJECT \
--region=$GCP_REGION \
--platform=managed \
--allow-unauthenticated
# 6. 等待部署完成
# 预期输出: Service [accounts-svc-plus] revision [accounts-svc-plus-xxxxx] has been deployed
```
### 步骤 2: 验证部署
```bash
# 测试 API 端点
curl -s -H "Authorization: Bearer uTvryFvAbz6M5sRtmTaSTQY6otLZ95hneBsWqXu+35I=" \
"https://accounts-svc-plus-266500572462.asia-northeast1.run.app/api/agent-server/v1/users"
# 应该返回 JSON 而不是 404
```
### 步骤 3: 监控 Agent 日志
```bash
# 在 agent 节点上监控日志
ssh root@hk-xhttp.svc.plus
journalctl -u agent-svc-plus -f
# 等待下一次同步周期5分钟
# 确认没有 404 错误
```
## 回滚计划
### 如果 Cloud Run 部署导致问题
```bash
# 1. 查看之前的版本
gcloud run revisions list \
--service=accounts-svc-plus \
--project=xzerolab-480008 \
--region=asia-northeast1
# 2. 回滚到之前的版本
gcloud run services update-traffic accounts-svc-plus \
--to-revisions=PREVIOUS_REVISION=100 \
--project=xzerolab-480008 \
--region=asia-northeast1
```
### 如果 UUID 变更导致问题
```sql
-- 反向操作(需要提前备份数据)
BEGIN;
-- 重命名当前用户
UPDATE users
SET username = username || '_new',
email = email || '_new'
WHERE uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2';
-- 恢复旧 UUID
INSERT INTO users (uuid, username, password, email, ...)
SELECT '4b66928e-a81e-4981-bae0-289ddb92439c',
REPLACE(username, '_new', ''),
...
FROM users
WHERE uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2';
-- 更新外键
UPDATE identities SET user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c'
WHERE user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2';
UPDATE sessions SET user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c'
WHERE user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2';
UPDATE subscriptions SET user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c'
WHERE user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2';
-- 删除新记录
DELETE FROM users WHERE uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2';
COMMIT;
```
## 已知问题
### 1. `/api/agent/nodes` 返回 401 错误
- **现象**: 前端访问 `/api/agent/nodes` 时收到 401 Unauthorized
- **原因**: 认证 token 未正确传递到该端点
- **影响**: 用户无法查看节点列表
- **状态**: 待修复
- **临时方案**: 直接访问后端 API 或使用 admin 账户
### 2. Agent API 路由未部署到生产环境 ⚠️ **阻塞问题**
- **现象**: Cloud Run 服务返回 404
- **原因**: 生产环境运行旧版本代码
- **影响**: Agent 无法同步配置
- **状态**: **需要立即部署**
- **修复**: 执行 `make cloudrun-deploy`
## 相关文档
- [Agent 架构文档](../docs/agent-architecture.md)
- [数据库 Schema](../sql/schema.sql)
- [API 路由配置](../api/api.go)
- [Cloud Run 部署文档](../deploy/gcp/cloud-run/README.md)
## 附录
### Agent 配置示例
**文件**: `/etc/agent/account-agent.yaml`
```yaml
mode: "agent"
log:
level: info
agent:
id: "hk-proxy-server"
controllerUrl: "https://accounts-svc-plus-266500572462.asia-northeast1.run.app"
apiToken: "uTvryFvAbz6M5sRtmTaSTQY6otLZ95hneBsWqXu+35I="
httpTimeout: 15s
statusInterval: 1m
syncInterval: 5m
tls:
insecureSkipVerify: false
xray:
sync:
enabled: true
interval: 5m
targets:
- name: "xhttp"
outputPath: "/usr/local/etc/xray/config.json"
templatePath: "/usr/local/etc/xray/templates/xray.xhttp.template.json"
restartCommand:
- "systemctl"
- "restart"
- "xray.service"
- name: "tcp"
outputPath: "/usr/local/etc/xray/tcp-config.json"
templatePath: "/usr/local/etc/xray/templates/xray.tcp.template.json"
restartCommand:
- "systemctl"
- "restart"
- "xray-tcp.service"
```
### 数据库连接信息
```bash
# SSH 连接
ssh -i ~/.ssh/id_rsa root@postgresql.svc.plus
# Docker 容器名称
postgresql-svc-plus
# 数据库名称
account
# 用户名
postgres
# 密码
见 .env 文件
```
### 相关服务
- **accounts-svc-plus**: Cloud Run 服务,处理认证和用户管理
- URL: `https://accounts-svc-plus-266500572462.asia-northeast1.run.app`
- 域名: `https://accounts.svc.plus`
- **console.svc.plus**: 前端控制台
- URL: `https://www.svc.plus`
- **agent.svc.plus**: Agent 服务节点
- 节点: `hk-xhttp.svc.plus`, `jp-xhttp.svc.plus`, `us-xhttp.svc.plus`
### 监控和日志
```bash
# 查看 Cloud Run 日志
gcloud run services logs read accounts-svc-plus \
--project=xzerolab-480008 \
--region=asia-northeast1 \
--limit=100
# 查看 Agent 日志
ssh root@hk-xhttp.svc.plus "journalctl -u agent-svc-plus -n 100 --no-pager"
# 查看数据库日志
ssh -i ~/.ssh/id_rsa root@postgresql.svc.plus \
"docker logs postgresql-svc-plus --tail=100"
```
### 关键 API 端点
```bash
# Agent API 端点(需要 Bearer token
GET /api/agent-server/v1/users # 获取用户列表
POST /api/agent-server/v1/status # 上报 agent 状态
# 用户 API 端点(需要用户认证)
GET /api/agent/nodes # 获取 agent 节点列表
# 认证端点
GET /api/auth/session # 获取当前会话
POST /api/auth/login # 用户登录
```
### 故障排查清单
- [ ] 检查 Cloud Run 服务是否运行最新版本
- [ ] 验证 agent API 端点返回 200 而不是 404
- [ ] 检查 agent 配置文件中的 controller URL 和 token
- [ ] 查看 agent 日志确认没有 404 错误
- [ ] 验证数据库中的 UUID 已正确更新
- [ ] 检查所有外键引用是否指向新 UUID
- [ ] 测试前端页面是否能正常加载
- [ ] 监控 Cloud Run 日志确认没有新错误

View File

@ -0,0 +1,254 @@
# Cloud Run Stunnel Sidecar 启动失败导致服务无法启动
**类型**: 故障排查
**严重级别**: P1 (Critical)
**最后更新**: 2026-01-28
**负责人**: SRE Team
---
## 📋 问题描述
Cloud Run 部署 `accounts-svc-plus` 服务时,容器启动失败并报错:
```
ERROR: (gcloud.run.services.update) The user-provided container failed to start
and listen on the port defined provided by the PORT=8080 environment variable
within the allocated timeout.
```
**错误日志链接示例**:
```
https://console.cloud.google.com/logs/viewer?project=xzerolab-480008&resource=cloud_run_revision/service_name/accounts-svc-plus/revision_name/accounts-svc-plus-00049-gjv
```
---
## 🎯 影响范围
- **服务**: `accounts-svc-plus` (账号服务)
- **影响功能**:
- 用户登录/注册
- 账号管理 API
- 所有依赖账号服务的下游系统
- **影响用户**: 全部用户
- **持续时间**: 直到修复完成
---
## 🔍 根因分析
### 架构背景
该服务使用 **Sidecar 模式** 部署:
- **主容器**: `accounts-api` (Go 应用)
- **Sidecar 容器**: `stunnel-sidecar` (TLS 隧道,用于连接远程 PostgreSQL)
### 问题链路
1. **Stunnel 配置问题**:
- `stunnel.conf` 配置中指定 PID 文件路径为 `/var/run/stunnel/stunnel-account-db-client.pid`
- Sidecar 容器 (`dweomer/stunnel`) 中该目录不存在或无写权限
- Stunnel 进程启动失败
2. **主容器启动依赖**:
- `entrypoint.sh` 脚本检测 `DB_HOST:DB_PORT` (127.0.0.1:15432) 是否可达
- Stunnel 未启动 → 15432 端口未监听
- 主应用尝试连接数据库失败 → 进程退出
3. **Cloud Run 健康检查**:
- `startupProbe` 检测 8080 端口 TCP 连接
- 主应用未启动 → 健康检查失败
- Cloud Run 判定容器启动失败
### 配置文件位置
- **Stunnel 配置**: `deploy/gcp/cloud-run/stunnel.conf`
- **Secret 管理**: Google Secret Manager (`stunnel-config`)
- **Service YAML**: `deploy/gcp/cloud-run/service.yaml`
---
## 🛠️ 诊断步骤
### 1. 查看 Cloud Run 日志
```bash
gcloud logging read "resource.type=cloud_run_revision \
AND resource.labels.service_name=accounts-svc-plus" \
--limit 50 --format json --project xzerolab-480008
```
**关键错误信息**:
- `stunnel: Cannot create pid file`
- `nc: connect to 127.0.0.1 port 15432 (tcp) failed: Connection refused`
- `stunnel not ready after 30s`
### 2. 检查 Stunnel 配置
```bash
# 查看当前 Secret 版本
gcloud secrets versions list stunnel-config --project xzerolab-480008
# 查看配置内容
gcloud secrets versions access latest --secret=stunnel-config \
--project xzerolab-480008
```
### 3. 本地复现(可选)
```bash
# 拉取 Sidecar 镜像
docker pull dweomer/stunnel
# 测试配置
docker run --rm -v $(pwd)/deploy/gcp/cloud-run/stunnel.conf:/etc/stunnel/stunnel.conf \
dweomer/stunnel stunnel /etc/stunnel/stunnel.conf
```
---
## ✅ 修复方案
### 步骤 1: 修改 Stunnel 配置
编辑 `deploy/gcp/cloud-run/stunnel.conf`:
```diff
; Stunnel configuration for Cloud Run (client mode)
-pid = /var/run/stunnel/stunnel-account-db-client.pid
-output = /var/run/stunnel/stunnel-account-db-client.log
+pid = /tmp/stunnel.pid
+# output = /dev/stdout
foreground = yes
```
**修改说明**:
- `/tmp` 目录在所有容器中都可写
- 注释掉 `output` 配置,默认输出到 stdout/stderrCloud Run 会自动收集)
- `foreground = yes` 确保进程不会后台运行Cloud Run 要求)
### 步骤 2: 更新 Secret
```bash
cd /path/to/accounts.svc.plus
# 更新 Secret会创建新版本
make cloudrun-stunnel GCP_PROJECT=xzerolab-480008
# 或手动执行
gcloud secrets versions add stunnel-config \
--data-file deploy/gcp/cloud-run/stunnel.conf \
--project xzerolab-480008
```
### 步骤 3: 重新部署服务
```bash
# 触发新部署(会拉取最新 Secret 版本)
make cloudrun-deploy GCP_PROJECT=xzerolab-480008
# 或手动执行
gcloud run services replace deploy/gcp/cloud-run/service.yaml \
--region asia-northeast1 \
--project xzerolab-480008
```
---
## 🧪 验证方法
### 1. 检查部署状态
```bash
gcloud run services describe accounts-svc-plus \
--region asia-northeast1 \
--project xzerolab-480008 \
--format="value(status.conditions)"
```
**预期输出**: `Ready: True`
### 2. 测试健康检查
```bash
SERVICE_URL=$(gcloud run services describe accounts-svc-plus \
--region asia-northeast1 \
--project xzerolab-480008 \
--format="value(status.url)")
curl -f "${SERVICE_URL}/healthz"
```
**预期输出**: `{"status":"ok"}`
### 3. 测试登录 API
```bash
curl -X POST "${SERVICE_URL}/api/auth/login" \
-H "Content-Type: application/json" \
-d '{"email":"test@example.com","password":"test123"}'
```
**预期输出**: 返回错误信息(如 `user_not_found`),而非连接超时或 500 错误
### 4. 查看实时日志
```bash
gcloud run services logs read accounts-svc-plus \
--region asia-northeast1 \
--project xzerolab-480008 \
--limit 20
```
**关键成功信息**:
- `Service [postgres-client] accepted connection from 127.0.0.1`
- `s_connect: connected <DB_IP>:443`
- `configured cors`
- `starting account service`
---
## 🔄 回滚计划
如果修复失败,执行以下回滚步骤:
### 方案 A: 回滚到上一个稳定版本
```bash
# 查看历史 Revision
gcloud run revisions list --service accounts-svc-plus \
--region asia-northeast1 \
--project xzerolab-480008
# 回滚到指定版本(替换 REVISION_NAME
gcloud run services update-traffic accounts-svc-plus \
--to-revisions REVISION_NAME=100 \
--region asia-northeast1 \
--project xzerolab-480008
```
### 方案 B: 临时禁用 Stunnel仅测试环境
修改 `deploy/gcp/cloud-run/service.yaml`,移除 `stunnel-sidecar` 容器,并将数据库连接改为直连(需配置 Cloud SQL Proxy 或公网访问)。
---
## 📚 相关文档
- [Cloud Run Troubleshooting Guide](https://cloud.google.com/run/docs/troubleshooting)
- [Stunnel Documentation](https://www.stunnel.org/docs.html)
- [Cloud Run Sidecar Pattern](https://cloud.google.com/run/docs/deploying#sidecars)
- [Google Secret Manager](https://cloud.google.com/secret-manager/docs)
---
## 📝 经验总结
### 预防措施
1. **本地测试**: 在部署前使用 Docker Compose 模拟 Sidecar 环境
2. **配置验证**: 添加 CI/CD 步骤验证 `stunnel.conf` 语法
3. **监控告警**: 配置 Cloud Run 启动失败告警Alerting Policy
### 最佳实践
- Sidecar 容器的配置文件应使用容器内可写路径(如 `/tmp`
- 日志输出优先使用 stdout/stderr便于 Cloud Run 日志聚合
- 主容器启动脚本应设置合理的依赖等待超时(当前为 30s
### 改进建议
- 考虑使用 Cloud SQL Proxy 替代 Stunnel官方推荐方案
- 添加 Stunnel 健康检查端点(如 HTTP status endpoint
- 在 `entrypoint.sh` 中增加更详细的诊断日志
---
**案例编号**: CASE-2026-01-28-001
**创建时间**: 2026-01-28 23:19
**解决时间**: 2026-01-28 23:19
**总耗时**: ~20 分钟

View File

@ -15,6 +15,10 @@
- 应急处理
### 🔄 运维操作
- [Security Scrubbing Archive](./Security-Scrubbing-Archive-2026-02-06.md) - 历史敏感信息深度脱敏记录。
- [Fix Rotating UUID Sync Archive](./Fix-Rotating-UUID-Sync-Archive-2026-02-06.md) - P1 级别 Sandbox 旋转 UUID 同步故障修复记录。
- [Fix Agent 404 and UUID Change](./Fix-Agent-404-And-UUID-Change.md) - 解决代理 API 认证 404 及 UUID 变更问题。
- [Fix CloudRun Stunnel Startup Failure](./Fix-CloudRun-Stunnel-Startup-Failure.md) - 解决 CloudRun 环境下 Stunnel 启动失败问题。
- 日常维护
- 备份恢复
- 性能优化