Runbook: Moved into github-org-cloud-neutral-toolkit/docs/
This commit is contained in:
parent
04ff61f952
commit
df6f1dd92b
@ -1,667 +0,0 @@
|
||||
# 修复 Agent 404 错误和用户 UUID 变更
|
||||
|
||||
**日期**: 2026-02-05
|
||||
**负责人**: SRE Team
|
||||
**审核人**: DevOps Lead
|
||||
**最后更新**: 2026-02-05T15:28:00+08:00
|
||||
|
||||
## 问题描述
|
||||
|
||||
### 1. Agent 通信 404 错误
|
||||
- **现象**: Agent 服务在向 `accounts-svc-plus` 报告状态时收到 404 错误
|
||||
- **影响范围**: 所有 agent 节点无法正常上报心跳和配置同步
|
||||
- **错误日志**:
|
||||
```
|
||||
Feb 05 07:24:23 hk-xhttp.svc.plus agent-svc-plus[107285]:
|
||||
{"time":"2026-02-05T07:24:23.907002669Z","level":"ERROR","msg":"xray config sync failed",
|
||||
"component":"agent-xray-sync","target":"tcp",
|
||||
"err":"list clients: controller returned 404 Not Found: 404 page not found"}
|
||||
|
||||
POST 404 https://accounts-svc-plus-266500572462.asia-northeast1.run.app/api/agent-server/v1/status
|
||||
GET 404 https://accounts-svc-plus-266500572462.asia-northeast1.run.app/api/agent-server/v1/users
|
||||
```
|
||||
|
||||
### 2. 用户 UUID 变更需求
|
||||
- **用户**: tester123@example.com
|
||||
- **原 UUID**: `4b66928e-a81e-4981-bae0-289ddb92439c`
|
||||
- **新 UUID**: `18d270a9-533d-4b13-b3f1-e7f55540a9b2`
|
||||
- **原因**: 业务需求,需要将用户 ID 更改为指定值
|
||||
|
||||
### 3. Agent 节点数据显示问题
|
||||
- **现象**: `/panel/agent` 页面显示 "Loading control center..."
|
||||
- **影响**: 用户无法查看运行节点状态
|
||||
|
||||
## 根本原因分析
|
||||
|
||||
### Agent 404 错误的根本原因
|
||||
|
||||
1. **代码已正确实现**:
|
||||
- `accounts.svc.plus/cmd/accountsvc/main.go` 第 1061-1070 行已注册 `/api/agent-server/v1/*` 路由
|
||||
- 包括 `GET /api/agent-server/v1/users` 和 `POST /api/agent-server/v1/status`
|
||||
|
||||
2. **生产环境未部署最新代码**:
|
||||
- Cloud Run 服务 `accounts-svc-plus` 运行的是旧版本代码
|
||||
- 旧版本不包含 agent API 路由
|
||||
- 测试确认:`curl https://accounts-svc-plus-266500572462.asia-northeast1.run.app/api/agent-server/v1/users` 返回 `404 page not found`
|
||||
|
||||
3. **Agent 配置正确**:
|
||||
- Agent 配置文件:`/etc/agent/account-agent.yaml`
|
||||
- Controller URL: `https://accounts-svc-plus-266500572462.asia-northeast1.run.app`
|
||||
- API Token: 正确配置(与 `INTERNAL_SERVICE_TOKEN` 匹配)
|
||||
|
||||
## 诊断步骤
|
||||
|
||||
### 1. 检查 Agent 日志
|
||||
```bash
|
||||
# 在 agent 节点上查看日志
|
||||
ssh root@hk-xhttp.svc.plus
|
||||
journalctl -u agent-svc-plus -n 50 --no-pager
|
||||
|
||||
# 发现错误
|
||||
# "err":"list clients: controller returned 404 Not Found: 404 page not found"
|
||||
```
|
||||
|
||||
### 2. 检查 Agent 配置
|
||||
```bash
|
||||
# 查看 agent 配置
|
||||
ssh root@hk-xhttp.svc.plus "cat /etc/agent/account-agent.yaml"
|
||||
|
||||
# 确认 controller URL 和 token 配置正确
|
||||
# controllerUrl: "https://accounts-svc-plus-266500572462.asia-northeast1.run.app"
|
||||
# apiToken: "uTvryFvAbz6M5sRtmTaSTQY6otLZ95hneBsWqXu+35I="
|
||||
```
|
||||
|
||||
### 3. 测试 API 端点
|
||||
```bash
|
||||
# 测试 /api/agent-server/v1/users 端点
|
||||
curl -s -H "Authorization: Bearer uTvryFvAbz6M5sRtmTaSTQY6otLZ95hneBsWqXu+35I=" \
|
||||
"https://accounts-svc-plus-266500572462.asia-northeast1.run.app/api/agent-server/v1/users"
|
||||
|
||||
# 返回: 404 page not found
|
||||
# 确认生产环境缺少该路由
|
||||
```
|
||||
|
||||
### 4. 检查代码实现
|
||||
```bash
|
||||
# 检查路由注册代码
|
||||
grep -n "registerAgentAPIRoutes" accounts.svc.plus/cmd/accountsvc/main.go
|
||||
|
||||
# 第 852 行: registerAgentAPIRoutes(r, agentRegistry, gormSource, logger)
|
||||
# 第 1061 行: func registerAgentAPIRoutes(...)
|
||||
# 确认代码中已正确实现
|
||||
```
|
||||
|
||||
### 5. 检查数据库约束(UUID 变更)
|
||||
```bash
|
||||
# 连接到 PostgreSQL
|
||||
ssh -i ~/.ssh/id_rsa root@postgresql.svc.plus
|
||||
|
||||
# 查看外键约束
|
||||
docker exec postgresql-svc-plus psql -U postgres -d account -c "
|
||||
SELECT conname, conrelid::regclass
|
||||
FROM pg_constraint
|
||||
WHERE confrelid = 'public.users'::regclass;
|
||||
"
|
||||
|
||||
# 结果显示:
|
||||
# - identities_user_uuid_fkey
|
||||
# - sessions_user_uuid_fkey
|
||||
# - subscriptions_user_uuid_fkey
|
||||
```
|
||||
|
||||
## 修复方案
|
||||
|
||||
### 修复 1: 部署最新代码到 Cloud Run ⚠️ **关键修复**
|
||||
|
||||
**问题**: 生产环境的 Cloud Run 服务运行的是旧版本代码,缺少 agent API 路由
|
||||
|
||||
**解决方案**: 重新构建和部署 `accounts-svc-plus` 服务
|
||||
|
||||
```bash
|
||||
# 1. 进入项目目录
|
||||
cd /Users/shenlan/workspaces/cloud-neutral-toolkit/accounts.svc.plus
|
||||
|
||||
# 2. 设置 GCP 项目
|
||||
export GCP_PROJECT=xzerolab-480008
|
||||
|
||||
# 3. 构建并推送 Docker 镜像
|
||||
make cloudrun-build
|
||||
|
||||
# 4. 部署到 Cloud Run
|
||||
make cloudrun-deploy
|
||||
|
||||
# 或者使用 gcloud 命令直接部署
|
||||
gcloud run deploy accounts-svc-plus \
|
||||
--source . \
|
||||
--project=xzerolab-480008 \
|
||||
--region=asia-northeast1 \
|
||||
--platform=managed \
|
||||
--allow-unauthenticated
|
||||
```
|
||||
|
||||
**预期结果**:
|
||||
- Cloud Run 服务更新为最新版本
|
||||
- `/api/agent-server/v1/users` 和 `/api/agent-server/v1/status` 端点可用
|
||||
- Agent 能够成功同步配置
|
||||
|
||||
### 修复 2: 添加前端 Agent Server 代理路由
|
||||
|
||||
**文件**: `console.svc.plus/src/app/api/agent-server/[...segments]/route.ts`
|
||||
|
||||
```typescript
|
||||
export const dynamic = 'force-dynamic'
|
||||
|
||||
import type { NextRequest } from 'next/server'
|
||||
|
||||
import { createUpstreamProxyHandler } from '@lib/apiProxy'
|
||||
import { getAccountServiceBaseUrl } from '@server/serviceConfig'
|
||||
|
||||
const AGENT_SERVER_PREFIX = '/api/agent-server'
|
||||
|
||||
function createHandler() {
|
||||
const upstreamBaseUrl = getAccountServiceBaseUrl()
|
||||
return createUpstreamProxyHandler({
|
||||
upstreamBaseUrl,
|
||||
upstreamPathPrefix: AGENT_SERVER_PREFIX,
|
||||
})
|
||||
}
|
||||
|
||||
const handler = createHandler()
|
||||
|
||||
export function GET(request: NextRequest) {
|
||||
return handler(request)
|
||||
}
|
||||
|
||||
export function POST(request: NextRequest) {
|
||||
return handler(request)
|
||||
}
|
||||
|
||||
export function PUT(request: NextRequest) {
|
||||
return handler(request)
|
||||
}
|
||||
|
||||
export function PATCH(request: NextRequest) {
|
||||
return handler(request)
|
||||
}
|
||||
|
||||
export function DELETE(request: NextRequest) {
|
||||
return handler(request)
|
||||
}
|
||||
|
||||
export function HEAD(request: NextRequest) {
|
||||
return handler(request)
|
||||
}
|
||||
|
||||
export function OPTIONS(request: NextRequest) {
|
||||
return handler(request)
|
||||
}
|
||||
```
|
||||
|
||||
**说明**:
|
||||
- 创建代理路由将前端的 `/api/agent-server/*` 请求转发到 `accounts-svc-plus`
|
||||
- 这个路由主要用于前端调试,agent 服务直接调用 Cloud Run URL
|
||||
|
||||
### 修复 3: 增强 Registry 持久化和日志
|
||||
|
||||
**文件**: `accounts.svc.plus/internal/agentserver/registry.go`
|
||||
|
||||
**变更**:
|
||||
1. 添加 `logger *slog.Logger` 字段到 `Registry` 结构体
|
||||
2. 添加 `SetLogger()` 方法
|
||||
3. 在 `RegisterAgent()` 和 `ReportStatus()` 中添加错误日志
|
||||
|
||||
**关键代码**:
|
||||
```go
|
||||
// 在 ReportStatus 中添加日志
|
||||
if err := r.store.UpsertAgent(ctx, dbAgent); err != nil {
|
||||
r.logger.Error("failed to persist agent status heartbeat", "agent", a.ID, "err", err)
|
||||
}
|
||||
|
||||
// 在 RegisterAgent 中添加日志
|
||||
if err := r.store.UpsertAgent(ctx, dbAgent); err != nil {
|
||||
r.logger.Error("failed to persist dynamically registered agent", "agent", id, "err", err)
|
||||
}
|
||||
```
|
||||
|
||||
**文件**: `accounts.svc.plus/cmd/accountsvc/main.go`
|
||||
|
||||
```go
|
||||
if agentRegistry != nil {
|
||||
agentRegistry.SetStore(st)
|
||||
agentRegistry.SetLogger(logger.With("component", "agent-registry"))
|
||||
// ... 其余代码
|
||||
}
|
||||
```
|
||||
|
||||
### 修复 4: 用户 UUID 变更
|
||||
|
||||
**连接数据库**:
|
||||
```bash
|
||||
ssh -i ~/.ssh/id_rsa root@postgresql.svc.plus
|
||||
```
|
||||
|
||||
**执行 SQL 事务**:
|
||||
```sql
|
||||
BEGIN;
|
||||
|
||||
-- 1. 重命名旧用户(避免唯一约束冲突)
|
||||
UPDATE users
|
||||
SET username = username || '_old',
|
||||
email = email || '_old'
|
||||
WHERE uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
|
||||
|
||||
-- 2. 创建新用户记录(使用新 UUID)
|
||||
INSERT INTO users (
|
||||
uuid, username, password, email, role, level, groups, permissions,
|
||||
created_at, updated_at, version, origin_node, mfa_totp_secret,
|
||||
mfa_enabled, mfa_secret_issued_at, mfa_confirmed_at, email_verified_at
|
||||
)
|
||||
SELECT
|
||||
'18d270a9-533d-4b13-b3f1-e7f55540a9b2',
|
||||
REPLACE(username, '_old', ''),
|
||||
password,
|
||||
REPLACE(email, '_old', ''),
|
||||
role, level, groups, permissions,
|
||||
created_at, updated_at, version, origin_node, mfa_totp_secret,
|
||||
mfa_enabled, mfa_secret_issued_at, mfa_confirmed_at, email_verified_at
|
||||
FROM users
|
||||
WHERE uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
|
||||
|
||||
-- 3. 更新所有外键引用
|
||||
UPDATE identities
|
||||
SET user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2'
|
||||
WHERE user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
|
||||
|
||||
UPDATE sessions
|
||||
SET user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2'
|
||||
WHERE user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
|
||||
|
||||
UPDATE subscriptions
|
||||
SET user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2'
|
||||
WHERE user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
|
||||
|
||||
-- 4. 删除旧用户记录
|
||||
DELETE FROM users
|
||||
WHERE uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
**执行命令**:
|
||||
```bash
|
||||
docker exec postgresql-svc-plus psql -U postgres -d account -c "
|
||||
BEGIN;
|
||||
UPDATE users SET username = username || '_old', email = email || '_old' WHERE uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
|
||||
INSERT INTO users (uuid, username, password, email, role, level, groups, permissions, created_at, updated_at, version, origin_node, mfa_totp_secret, mfa_enabled, mfa_secret_issued_at, mfa_confirmed_at, email_verified_at)
|
||||
SELECT '18d270a9-533d-4b13-b3f1-e7f55540a9b2', REPLACE(username, '_old', ''), password, REPLACE(email, '_old', ''), role, level, groups, permissions, created_at, updated_at, version, origin_node, mfa_totp_secret, mfa_enabled, mfa_secret_issued_at, mfa_confirmed_at, email_verified_at
|
||||
FROM users WHERE uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
|
||||
UPDATE identities SET user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2' WHERE user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
|
||||
UPDATE sessions SET user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2' WHERE user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
|
||||
UPDATE subscriptions SET user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2' WHERE user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
|
||||
DELETE FROM users WHERE uuid = '4b66928e-a81e-4981-bae0-289ddb92439c';
|
||||
COMMIT;
|
||||
"
|
||||
```
|
||||
|
||||
**状态**: ✅ 已完成
|
||||
|
||||
### 修复 5: 改进前端错误处理
|
||||
|
||||
**文件**: `console.svc.plus/src/modules/extensions/builtin/user-center/routes/agent.tsx`
|
||||
|
||||
**变更**:
|
||||
1. 改进 `fetcher` 函数的错误处理
|
||||
2. 添加错误消息显示
|
||||
|
||||
```typescript
|
||||
async function fetcher(url: string): Promise<VlessNode[]> {
|
||||
const res = await fetch(url, { credentials: 'include', cache: 'no-store' })
|
||||
|
||||
const payload = await res.json().catch(() => null)
|
||||
if (!res.ok) {
|
||||
const message =
|
||||
(payload && typeof payload.message === 'string' && payload.message) ||
|
||||
(payload && typeof payload.error === 'string' && payload.error) ||
|
||||
`Request failed (${res.status})`
|
||||
throw new Error(message)
|
||||
}
|
||||
|
||||
if (Array.isArray(payload)) {
|
||||
return payload as VlessNode[]
|
||||
}
|
||||
if (payload && Array.isArray((payload as { nodes?: unknown }).nodes)) {
|
||||
return (payload as { nodes: VlessNode[] }).nodes
|
||||
}
|
||||
|
||||
return []
|
||||
}
|
||||
|
||||
// 在 UI 中显示错误
|
||||
{error && (
|
||||
<div className="rounded-xl border border-[color:var(--color-danger-border)] bg-[var(--color-danger-muted)]/30 px-4 py-3 text-sm text-[var(--color-danger-foreground)]">
|
||||
{language === 'zh'
|
||||
? `节点列表加载失败:${error.message}`
|
||||
: `Failed to load agent nodes: ${error.message}`}
|
||||
</div>
|
||||
)}
|
||||
```
|
||||
|
||||
**状态**: ✅ 已完成
|
||||
|
||||
## 验证方法
|
||||
|
||||
### 1. 验证 Cloud Run 部署 ⚠️ **关键验证**
|
||||
|
||||
```bash
|
||||
# 测试 agent API 端点
|
||||
curl -s -H "Authorization: Bearer uTvryFvAbz6M5sRtmTaSTQY6otLZ95hneBsWqXu+35I=" \
|
||||
"https://accounts-svc-plus-266500572462.asia-northeast1.run.app/api/agent-server/v1/users"
|
||||
|
||||
# 预期结果: JSON 响应包含用户列表
|
||||
# {
|
||||
# "clients": [...],
|
||||
# "total": N,
|
||||
# "generated_at": "2026-02-05T07:30:00Z"
|
||||
# }
|
||||
|
||||
# 如果仍返回 404,说明部署未成功
|
||||
```
|
||||
|
||||
### 2. 验证 Agent 同步
|
||||
|
||||
```bash
|
||||
# 在 agent 节点上查看日志
|
||||
ssh root@hk-xhttp.svc.plus
|
||||
journalctl -u agent-svc-plus -f
|
||||
|
||||
# 预期看到:
|
||||
# - "xray config synced successfully"
|
||||
# - 没有 404 错误
|
||||
```
|
||||
|
||||
### 3. 验证 UUID 变更
|
||||
|
||||
```bash
|
||||
# 查询新 UUID
|
||||
docker exec postgresql-svc-plus psql -U postgres -d account -c "
|
||||
SELECT uuid, username, email
|
||||
FROM users
|
||||
WHERE email = 'tester123@example.com';
|
||||
"
|
||||
|
||||
# 预期结果:
|
||||
# uuid | username | email
|
||||
# --------------------------------------+-----------+-----------------------
|
||||
# 18d270a9-533d-4b13-b3f1-e7f55540a9b2 | tester123 | tester123@example.com
|
||||
```
|
||||
|
||||
### 4. 验证关联数据
|
||||
|
||||
```bash
|
||||
# 检查订阅是否正确关联
|
||||
docker exec postgresql-svc-plus psql -U postgres -d account -c "
|
||||
SELECT user_uuid, external_id, status
|
||||
FROM subscriptions
|
||||
WHERE user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2';
|
||||
"
|
||||
```
|
||||
|
||||
### 5. 验证前端显示
|
||||
|
||||
```bash
|
||||
# 访问 https://www.svc.plus/panel/agent
|
||||
# 确认页面能够正常加载
|
||||
# 如果有 401 错误,检查认证 token 传递
|
||||
```
|
||||
|
||||
## 部署步骤
|
||||
|
||||
### 步骤 1: 部署 accounts-svc-plus 到 Cloud Run
|
||||
|
||||
```bash
|
||||
# 1. 进入项目目录
|
||||
cd /Users/shenlan/workspaces/cloud-neutral-toolkit/accounts.svc.plus
|
||||
|
||||
# 2. 确认代码已提交
|
||||
git status
|
||||
git add .
|
||||
git commit -m "feat: add agent API routes for /api/agent-server/v1"
|
||||
git push
|
||||
|
||||
# 3. 设置环境变量
|
||||
export GCP_PROJECT=xzerolab-480008
|
||||
export GCP_REGION=asia-northeast1
|
||||
|
||||
# 4. 构建镜像(如果使用 Makefile)
|
||||
make cloudrun-build
|
||||
|
||||
# 5. 更新 service.yaml 以使用 Secret Manager
|
||||
# 确保 service.yaml 中 INTERNAL_SERVICE_TOKEN 使用 valueFrom: secretKeyRef 配置
|
||||
|
||||
# 6. 部署服务
|
||||
make cloudrun-deploy
|
||||
|
||||
# 或者使用 gcloud 命令
|
||||
gcloud run deploy accounts-svc-plus \
|
||||
--source . \
|
||||
--project=$GCP_PROJECT \
|
||||
--region=$GCP_REGION \
|
||||
--platform=managed \
|
||||
--allow-unauthenticated
|
||||
|
||||
# 6. 等待部署完成
|
||||
# 预期输出: Service [accounts-svc-plus] revision [accounts-svc-plus-xxxxx] has been deployed
|
||||
```
|
||||
|
||||
### 步骤 2: 验证部署
|
||||
|
||||
```bash
|
||||
# 测试 API 端点
|
||||
curl -s -H "Authorization: Bearer uTvryFvAbz6M5sRtmTaSTQY6otLZ95hneBsWqXu+35I=" \
|
||||
"https://accounts-svc-plus-266500572462.asia-northeast1.run.app/api/agent-server/v1/users"
|
||||
|
||||
# 应该返回 JSON 而不是 404
|
||||
```
|
||||
|
||||
### 步骤 3: 监控 Agent 日志
|
||||
|
||||
```bash
|
||||
# 在 agent 节点上监控日志
|
||||
ssh root@hk-xhttp.svc.plus
|
||||
journalctl -u agent-svc-plus -f
|
||||
|
||||
# 等待下一次同步周期(5分钟)
|
||||
# 确认没有 404 错误
|
||||
```
|
||||
|
||||
## 回滚计划
|
||||
|
||||
### 如果 Cloud Run 部署导致问题
|
||||
|
||||
```bash
|
||||
# 1. 查看之前的版本
|
||||
gcloud run revisions list \
|
||||
--service=accounts-svc-plus \
|
||||
--project=xzerolab-480008 \
|
||||
--region=asia-northeast1
|
||||
|
||||
# 2. 回滚到之前的版本
|
||||
gcloud run services update-traffic accounts-svc-plus \
|
||||
--to-revisions=PREVIOUS_REVISION=100 \
|
||||
--project=xzerolab-480008 \
|
||||
--region=asia-northeast1
|
||||
```
|
||||
|
||||
### 如果 UUID 变更导致问题
|
||||
|
||||
```sql
|
||||
-- 反向操作(需要提前备份数据)
|
||||
BEGIN;
|
||||
|
||||
-- 重命名当前用户
|
||||
UPDATE users
|
||||
SET username = username || '_new',
|
||||
email = email || '_new'
|
||||
WHERE uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2';
|
||||
|
||||
-- 恢复旧 UUID
|
||||
INSERT INTO users (uuid, username, password, email, ...)
|
||||
SELECT '4b66928e-a81e-4981-bae0-289ddb92439c',
|
||||
REPLACE(username, '_new', ''),
|
||||
...
|
||||
FROM users
|
||||
WHERE uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2';
|
||||
|
||||
-- 更新外键
|
||||
UPDATE identities SET user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c'
|
||||
WHERE user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2';
|
||||
|
||||
UPDATE sessions SET user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c'
|
||||
WHERE user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2';
|
||||
|
||||
UPDATE subscriptions SET user_uuid = '4b66928e-a81e-4981-bae0-289ddb92439c'
|
||||
WHERE user_uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2';
|
||||
|
||||
-- 删除新记录
|
||||
DELETE FROM users WHERE uuid = '18d270a9-533d-4b13-b3f1-e7f55540a9b2';
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
## 已知问题
|
||||
|
||||
### 1. `/api/agent/nodes` 返回 401 错误
|
||||
- **现象**: 前端访问 `/api/agent/nodes` 时收到 401 Unauthorized
|
||||
- **原因**: 认证 token 未正确传递到该端点
|
||||
- **影响**: 用户无法查看节点列表
|
||||
- **状态**: 待修复
|
||||
- **临时方案**: 直接访问后端 API 或使用 admin 账户
|
||||
|
||||
### 2. Agent API 路由未部署到生产环境 ⚠️ **阻塞问题**
|
||||
- **现象**: Cloud Run 服务返回 404
|
||||
- **原因**: 生产环境运行旧版本代码
|
||||
- **影响**: Agent 无法同步配置
|
||||
- **状态**: **需要立即部署**
|
||||
- **修复**: 执行 `make cloudrun-deploy`
|
||||
|
||||
## 相关文档
|
||||
|
||||
- [Agent 架构文档](../docs/agent-architecture.md)
|
||||
- [数据库 Schema](../sql/schema.sql)
|
||||
- [API 路由配置](../api/api.go)
|
||||
- [Cloud Run 部署文档](../deploy/gcp/cloud-run/README.md)
|
||||
|
||||
## 附录
|
||||
|
||||
### Agent 配置示例
|
||||
|
||||
**文件**: `/etc/agent/account-agent.yaml`
|
||||
|
||||
```yaml
|
||||
mode: "agent"
|
||||
|
||||
log:
|
||||
level: info
|
||||
|
||||
agent:
|
||||
id: "hk-proxy-server"
|
||||
controllerUrl: "https://accounts-svc-plus-266500572462.asia-northeast1.run.app"
|
||||
apiToken: "uTvryFvAbz6M5sRtmTaSTQY6otLZ95hneBsWqXu+35I="
|
||||
httpTimeout: 15s
|
||||
statusInterval: 1m
|
||||
syncInterval: 5m
|
||||
tls:
|
||||
insecureSkipVerify: false
|
||||
|
||||
xray:
|
||||
sync:
|
||||
enabled: true
|
||||
interval: 5m
|
||||
targets:
|
||||
- name: "xhttp"
|
||||
outputPath: "/usr/local/etc/xray/config.json"
|
||||
templatePath: "/usr/local/etc/xray/templates/xray.xhttp.template.json"
|
||||
restartCommand:
|
||||
- "systemctl"
|
||||
- "restart"
|
||||
- "xray.service"
|
||||
- name: "tcp"
|
||||
outputPath: "/usr/local/etc/xray/tcp-config.json"
|
||||
templatePath: "/usr/local/etc/xray/templates/xray.tcp.template.json"
|
||||
restartCommand:
|
||||
- "systemctl"
|
||||
- "restart"
|
||||
- "xray-tcp.service"
|
||||
```
|
||||
|
||||
### 数据库连接信息
|
||||
|
||||
```bash
|
||||
# SSH 连接
|
||||
ssh -i ~/.ssh/id_rsa root@postgresql.svc.plus
|
||||
|
||||
# Docker 容器名称
|
||||
postgresql-svc-plus
|
||||
|
||||
# 数据库名称
|
||||
account
|
||||
|
||||
# 用户名
|
||||
postgres
|
||||
|
||||
# 密码
|
||||
见 .env 文件
|
||||
```
|
||||
|
||||
### 相关服务
|
||||
|
||||
- **accounts-svc-plus**: Cloud Run 服务,处理认证和用户管理
|
||||
- URL: `https://accounts-svc-plus-266500572462.asia-northeast1.run.app`
|
||||
- 域名: `https://accounts.svc.plus`
|
||||
- **console.svc.plus**: 前端控制台
|
||||
- URL: `https://www.svc.plus`
|
||||
- **agent.svc.plus**: Agent 服务节点
|
||||
- 节点: `hk-xhttp.svc.plus`, `jp-xhttp.svc.plus`, `us-xhttp.svc.plus`
|
||||
|
||||
### 监控和日志
|
||||
|
||||
```bash
|
||||
# 查看 Cloud Run 日志
|
||||
gcloud run services logs read accounts-svc-plus \
|
||||
--project=xzerolab-480008 \
|
||||
--region=asia-northeast1 \
|
||||
--limit=100
|
||||
|
||||
# 查看 Agent 日志
|
||||
ssh root@hk-xhttp.svc.plus "journalctl -u agent-svc-plus -n 100 --no-pager"
|
||||
|
||||
# 查看数据库日志
|
||||
ssh -i ~/.ssh/id_rsa root@postgresql.svc.plus \
|
||||
"docker logs postgresql-svc-plus --tail=100"
|
||||
```
|
||||
|
||||
### 关键 API 端点
|
||||
|
||||
```bash
|
||||
# Agent API 端点(需要 Bearer token)
|
||||
GET /api/agent-server/v1/users # 获取用户列表
|
||||
POST /api/agent-server/v1/status # 上报 agent 状态
|
||||
|
||||
# 用户 API 端点(需要用户认证)
|
||||
GET /api/agent/nodes # 获取 agent 节点列表
|
||||
|
||||
# 认证端点
|
||||
GET /api/auth/session # 获取当前会话
|
||||
POST /api/auth/login # 用户登录
|
||||
```
|
||||
|
||||
### 故障排查清单
|
||||
|
||||
- [ ] 检查 Cloud Run 服务是否运行最新版本
|
||||
- [ ] 验证 agent API 端点返回 200 而不是 404
|
||||
- [ ] 检查 agent 配置文件中的 controller URL 和 token
|
||||
- [ ] 查看 agent 日志确认没有 404 错误
|
||||
- [ ] 验证数据库中的 UUID 已正确更新
|
||||
- [ ] 检查所有外键引用是否指向新 UUID
|
||||
- [ ] 测试前端页面是否能正常加载
|
||||
- [ ] 监控 Cloud Run 日志确认没有新错误
|
||||
@ -1,254 +0,0 @@
|
||||
# Cloud Run Stunnel Sidecar 启动失败导致服务无法启动
|
||||
|
||||
**类型**: 故障排查
|
||||
**严重级别**: P1 (Critical)
|
||||
**最后更新**: 2026-01-28
|
||||
**负责人**: SRE Team
|
||||
|
||||
---
|
||||
|
||||
## 📋 问题描述
|
||||
|
||||
Cloud Run 部署 `accounts-svc-plus` 服务时,容器启动失败并报错:
|
||||
|
||||
```
|
||||
ERROR: (gcloud.run.services.update) The user-provided container failed to start
|
||||
and listen on the port defined provided by the PORT=8080 environment variable
|
||||
within the allocated timeout.
|
||||
```
|
||||
|
||||
**错误日志链接示例**:
|
||||
```
|
||||
https://console.cloud.google.com/logs/viewer?project=xzerolab-480008&resource=cloud_run_revision/service_name/accounts-svc-plus/revision_name/accounts-svc-plus-00049-gjv
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎯 影响范围
|
||||
|
||||
- **服务**: `accounts-svc-plus` (账号服务)
|
||||
- **影响功能**:
|
||||
- 用户登录/注册
|
||||
- 账号管理 API
|
||||
- 所有依赖账号服务的下游系统
|
||||
- **影响用户**: 全部用户
|
||||
- **持续时间**: 直到修复完成
|
||||
|
||||
---
|
||||
|
||||
## 🔍 根因分析
|
||||
|
||||
### 架构背景
|
||||
该服务使用 **Sidecar 模式** 部署:
|
||||
- **主容器**: `accounts-api` (Go 应用)
|
||||
- **Sidecar 容器**: `stunnel-sidecar` (TLS 隧道,用于连接远程 PostgreSQL)
|
||||
|
||||
### 问题链路
|
||||
1. **Stunnel 配置问题**:
|
||||
- `stunnel.conf` 配置中指定 PID 文件路径为 `/var/run/stunnel/stunnel-account-db-client.pid`
|
||||
- Sidecar 容器 (`dweomer/stunnel`) 中该目录不存在或无写权限
|
||||
- Stunnel 进程启动失败
|
||||
|
||||
2. **主容器启动依赖**:
|
||||
- `entrypoint.sh` 脚本检测 `DB_HOST:DB_PORT` (127.0.0.1:15432) 是否可达
|
||||
- Stunnel 未启动 → 15432 端口未监听
|
||||
- 主应用尝试连接数据库失败 → 进程退出
|
||||
|
||||
3. **Cloud Run 健康检查**:
|
||||
- `startupProbe` 检测 8080 端口 TCP 连接
|
||||
- 主应用未启动 → 健康检查失败
|
||||
- Cloud Run 判定容器启动失败
|
||||
|
||||
### 配置文件位置
|
||||
- **Stunnel 配置**: `deploy/gcp/cloud-run/stunnel.conf`
|
||||
- **Secret 管理**: Google Secret Manager (`stunnel-config`)
|
||||
- **Service YAML**: `deploy/gcp/cloud-run/service.yaml`
|
||||
|
||||
---
|
||||
|
||||
## 🛠️ 诊断步骤
|
||||
|
||||
### 1. 查看 Cloud Run 日志
|
||||
```bash
|
||||
gcloud logging read "resource.type=cloud_run_revision \
|
||||
AND resource.labels.service_name=accounts-svc-plus" \
|
||||
--limit 50 --format json --project xzerolab-480008
|
||||
```
|
||||
|
||||
**关键错误信息**:
|
||||
- `stunnel: Cannot create pid file`
|
||||
- `nc: connect to 127.0.0.1 port 15432 (tcp) failed: Connection refused`
|
||||
- `stunnel not ready after 30s`
|
||||
|
||||
### 2. 检查 Stunnel 配置
|
||||
```bash
|
||||
# 查看当前 Secret 版本
|
||||
gcloud secrets versions list stunnel-config --project xzerolab-480008
|
||||
|
||||
# 查看配置内容
|
||||
gcloud secrets versions access latest --secret=stunnel-config \
|
||||
--project xzerolab-480008
|
||||
```
|
||||
|
||||
### 3. 本地复现(可选)
|
||||
```bash
|
||||
# 拉取 Sidecar 镜像
|
||||
docker pull dweomer/stunnel
|
||||
|
||||
# 测试配置
|
||||
docker run --rm -v $(pwd)/deploy/gcp/cloud-run/stunnel.conf:/etc/stunnel/stunnel.conf \
|
||||
dweomer/stunnel stunnel /etc/stunnel/stunnel.conf
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ✅ 修复方案
|
||||
|
||||
### 步骤 1: 修改 Stunnel 配置
|
||||
|
||||
编辑 `deploy/gcp/cloud-run/stunnel.conf`:
|
||||
|
||||
```diff
|
||||
; Stunnel configuration for Cloud Run (client mode)
|
||||
-pid = /var/run/stunnel/stunnel-account-db-client.pid
|
||||
-output = /var/run/stunnel/stunnel-account-db-client.log
|
||||
+pid = /tmp/stunnel.pid
|
||||
+# output = /dev/stdout
|
||||
foreground = yes
|
||||
```
|
||||
|
||||
**修改说明**:
|
||||
- `/tmp` 目录在所有容器中都可写
|
||||
- 注释掉 `output` 配置,默认输出到 stdout/stderr(Cloud Run 会自动收集)
|
||||
- `foreground = yes` 确保进程不会后台运行(Cloud Run 要求)
|
||||
|
||||
### 步骤 2: 更新 Secret
|
||||
```bash
|
||||
cd /path/to/accounts.svc.plus
|
||||
|
||||
# 更新 Secret(会创建新版本)
|
||||
make cloudrun-stunnel GCP_PROJECT=xzerolab-480008
|
||||
|
||||
# 或手动执行
|
||||
gcloud secrets versions add stunnel-config \
|
||||
--data-file deploy/gcp/cloud-run/stunnel.conf \
|
||||
--project xzerolab-480008
|
||||
```
|
||||
|
||||
### 步骤 3: 重新部署服务
|
||||
```bash
|
||||
# 触发新部署(会拉取最新 Secret 版本)
|
||||
make cloudrun-deploy GCP_PROJECT=xzerolab-480008
|
||||
|
||||
# 或手动执行
|
||||
gcloud run services replace deploy/gcp/cloud-run/service.yaml \
|
||||
--region asia-northeast1 \
|
||||
--project xzerolab-480008
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🧪 验证方法
|
||||
|
||||
### 1. 检查部署状态
|
||||
```bash
|
||||
gcloud run services describe accounts-svc-plus \
|
||||
--region asia-northeast1 \
|
||||
--project xzerolab-480008 \
|
||||
--format="value(status.conditions)"
|
||||
```
|
||||
|
||||
**预期输出**: `Ready: True`
|
||||
|
||||
### 2. 测试健康检查
|
||||
```bash
|
||||
SERVICE_URL=$(gcloud run services describe accounts-svc-plus \
|
||||
--region asia-northeast1 \
|
||||
--project xzerolab-480008 \
|
||||
--format="value(status.url)")
|
||||
|
||||
curl -f "${SERVICE_URL}/healthz"
|
||||
```
|
||||
|
||||
**预期输出**: `{"status":"ok"}`
|
||||
|
||||
### 3. 测试登录 API
|
||||
```bash
|
||||
curl -X POST "${SERVICE_URL}/api/auth/login" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"email":"test@example.com","password":"test123"}'
|
||||
```
|
||||
|
||||
**预期输出**: 返回错误信息(如 `user_not_found`),而非连接超时或 500 错误
|
||||
|
||||
### 4. 查看实时日志
|
||||
```bash
|
||||
gcloud run services logs read accounts-svc-plus \
|
||||
--region asia-northeast1 \
|
||||
--project xzerolab-480008 \
|
||||
--limit 20
|
||||
```
|
||||
|
||||
**关键成功信息**:
|
||||
- `Service [postgres-client] accepted connection from 127.0.0.1`
|
||||
- `s_connect: connected <DB_IP>:443`
|
||||
- `configured cors`
|
||||
- `starting account service`
|
||||
|
||||
---
|
||||
|
||||
## 🔄 回滚计划
|
||||
|
||||
如果修复失败,执行以下回滚步骤:
|
||||
|
||||
### 方案 A: 回滚到上一个稳定版本
|
||||
```bash
|
||||
# 查看历史 Revision
|
||||
gcloud run revisions list --service accounts-svc-plus \
|
||||
--region asia-northeast1 \
|
||||
--project xzerolab-480008
|
||||
|
||||
# 回滚到指定版本(替换 REVISION_NAME)
|
||||
gcloud run services update-traffic accounts-svc-plus \
|
||||
--to-revisions REVISION_NAME=100 \
|
||||
--region asia-northeast1 \
|
||||
--project xzerolab-480008
|
||||
```
|
||||
|
||||
### 方案 B: 临时禁用 Stunnel(仅测试环境)
|
||||
修改 `deploy/gcp/cloud-run/service.yaml`,移除 `stunnel-sidecar` 容器,并将数据库连接改为直连(需配置 Cloud SQL Proxy 或公网访问)。
|
||||
|
||||
---
|
||||
|
||||
## 📚 相关文档
|
||||
|
||||
- [Cloud Run Troubleshooting Guide](https://cloud.google.com/run/docs/troubleshooting)
|
||||
- [Stunnel Documentation](https://www.stunnel.org/docs.html)
|
||||
- [Cloud Run Sidecar Pattern](https://cloud.google.com/run/docs/deploying#sidecars)
|
||||
- [Google Secret Manager](https://cloud.google.com/secret-manager/docs)
|
||||
|
||||
---
|
||||
|
||||
## 📝 经验总结
|
||||
|
||||
### 预防措施
|
||||
1. **本地测试**: 在部署前使用 Docker Compose 模拟 Sidecar 环境
|
||||
2. **配置验证**: 添加 CI/CD 步骤验证 `stunnel.conf` 语法
|
||||
3. **监控告警**: 配置 Cloud Run 启动失败告警(Alerting Policy)
|
||||
|
||||
### 最佳实践
|
||||
- Sidecar 容器的配置文件应使用容器内可写路径(如 `/tmp`)
|
||||
- 日志输出优先使用 stdout/stderr,便于 Cloud Run 日志聚合
|
||||
- 主容器启动脚本应设置合理的依赖等待超时(当前为 30s)
|
||||
|
||||
### 改进建议
|
||||
- 考虑使用 Cloud SQL Proxy 替代 Stunnel(官方推荐方案)
|
||||
- 添加 Stunnel 健康检查端点(如 HTTP status endpoint)
|
||||
- 在 `entrypoint.sh` 中增加更详细的诊断日志
|
||||
|
||||
---
|
||||
|
||||
**案例编号**: CASE-2026-01-28-001
|
||||
**创建时间**: 2026-01-28 23:19
|
||||
**解决时间**: 2026-01-28 23:19
|
||||
**总耗时**: ~20 分钟
|
||||
@ -1,48 +0,0 @@
|
||||
# Runbook 目录
|
||||
|
||||
本目录包含该项目的运维手册和故障排查文档。
|
||||
|
||||
## 📚 文档分类
|
||||
|
||||
### 🔧 部署相关
|
||||
- 部署流程
|
||||
- 环境配置
|
||||
- 依赖管理
|
||||
|
||||
### 🚨 故障排查
|
||||
- 常见问题
|
||||
- 错误诊断
|
||||
- 应急处理
|
||||
|
||||
### 🔄 运维操作
|
||||
- 日常维护
|
||||
- 备份恢复
|
||||
- 性能优化
|
||||
|
||||
### 📊 监控告警
|
||||
- 监控指标
|
||||
- 告警规则
|
||||
- 日志分析
|
||||
|
||||
## 📝 文档规范
|
||||
|
||||
每个 Runbook 应包含:
|
||||
|
||||
1. **问题描述**: 清晰描述问题现象
|
||||
2. **影响范围**: 说明影响的功能和用户
|
||||
3. **诊断步骤**: 详细的问题定位方法
|
||||
4. **修复方案**: 具体的解决步骤
|
||||
5. **验证方法**: 确认问题已解决的检查清单
|
||||
6. **回滚计划**: 如果修复失败的应急方案
|
||||
|
||||
## 🎯 命名规范
|
||||
|
||||
- 使用描述性的文件名
|
||||
- 格式: `[类型]-[简短描述].md`
|
||||
- 示例: `Deploy-Database-Migration.md`, `Fix-API-Timeout.md`
|
||||
|
||||
## 📅 维护说明
|
||||
|
||||
- 定期更新文档
|
||||
- 记录最后更新时间
|
||||
- 标注负责人和审核人
|
||||
Loading…
Reference in New Issue
Block a user