xworkspace-console/.github/workflows/deploy-ai-workspace-iac.yaml
Haitao Pan b2c8c5d875 ci+docs: on-host bootstrap deploy job + console serving/verification updates
- deploy-ai-workspace-iac.yaml: deploy job now ssh-es to each host and runs
  the official curl|bash bootstrap locally (host-side ansible -c local,
  offline-accelerated), instead of running all-in-one from the runner (which
  breaks on roles/agent_skills delegate_to: localhost). provision job kept as
  the batch-provision mode.
- docs/operations: record final console fix (local python static backend),
  caddy/public-access architecture, and debian13/ubuntu26.04/macOS verification.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 09:44:22 +08:00

275 lines
11 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

name: Deploy AI Workspace (IaC + Ansible + Cloudflare)
# =============================================================================
# IaC ↔ Ansible 动态 inventory 联动的最终部署流水线(矩阵模式)
#
# provision : 批量起机模式开关terraform_action=apply / run_deploy
# 用 vultr-vps/envs/ai-workspace 创建主机Python+Jinja2 渲染显式
# HCL无 for_each导出 cmdb.json + inventory.ini并据此动态
# 生成下游部署矩阵。
# deploy : 矩阵按主机并行ssh 到主机本地跑官方引导curl|bash → host 内部
# ansible -c local自动离线包加速。与用户 self-host 同一路径;
# 不在 runner 远程跑 all-in-one会撞 agent_skills delegate_to localhost
# dns : 部署完成后,依据 inventory 的 service_domains/IP 同步 Cloudflare DNS。
#
# 数据契约 cmdb.json 由 ai-workspace-infra 的 generate.py 产出,贯穿三个 job。
#
# 需要在仓库 Settings → Secrets and variables → Actions 配置的 Secrets
# VULTR_API_KEY Vultr API Key→ TF_VAR_vultr_api_key
# INFRA_REPO_TOKEN 可读 ai-workspace-infra 的 PAT私有仓库时必需
# ANSIBLE_SSH_KEY 与 hosts.yaml 中公钥配对的 SSH 私钥(连主机用)
# CLOUDFLARE_API_TOKEN Cloudflare DNS 编辑权限 token
# DEEPSEEK_API_KEY \
# NVIDIA_API_KEY > LLM provider keys注入部署目标
# OLLAMA_API_KEY /
# 可选(远端 TF stateS3 兼容 / Vultr 对象存储):
# TF_STATE_ENDPOINT TF_STATE_BUCKET TF_STATE_ACCESS_KEY TF_STATE_SECRET_KEY TF_STATE_REGION
# =============================================================================
on:
workflow_dispatch:
inputs:
infra_ref:
description: "ai-workspace-infra git ref (iac_modules + playbooks)"
required: false
default: "main"
type: string
playbook:
description: "部署用的 playbook相对 playbooks/"
required: false
default: "setup-ai-workspace-all-in-one.yml"
type: string
terraform_action:
description: "apply 创建/更新destroy 销毁"
required: false
default: "apply"
type: choice
options: [apply, destroy]
run_deploy:
description: "provision 后是否执行 Ansible 部署"
required: false
default: true
type: boolean
run_dns:
description: "部署后是否同步 Cloudflare DNS"
required: false
default: true
type: boolean
permissions:
contents: read
concurrency:
group: deploy-ai-workspace-iac
cancel-in-progress: false
env:
INFRA_REPO: ${{ github.repository_owner }}/ai-workspace-infra
# vultr-vps 根(共享 scripts/ templates/ config/ENV_DIR 为 terraform 运行目录(workdir)
VPS_ROOT: infra/iac_modules/terraform-hcl-standard/vultr-vps
ENV_DIR: infra/iac_modules/terraform-hcl-standard/vultr-vps/envs/ai-workspace
PLAYBOOKS_DIR: infra/playbooks
jobs:
# ---------------------------------------------------------------------------
provision:
name: Provision (terraform + render CMDB)
runs-on: ubuntu-latest
env:
HAS_BACKEND: ${{ secrets.TF_STATE_BUCKET != '' }}
outputs:
hosts: ${{ steps.matrix.outputs.hosts }}
count: ${{ steps.matrix.outputs.count }}
steps:
- name: Checkout infra (iac_modules + playbooks)
uses: actions/checkout@v4
with:
repository: ${{ env.INFRA_REPO }}
ref: ${{ github.event.inputs.infra_ref || 'main' }}
token: ${{ secrets.INFRA_REPO_TOKEN || github.token }}
path: infra
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.9.8"
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install render deps
run: pip install --quiet pyyaml jinja2
- name: Configure remote backend (optional)
if: ${{ env.HAS_BACKEND == 'true' }}
working-directory: ${{ env.ENV_DIR }}
run: |
set -euo pipefail
cat > backend.tf <<'EOF'
terraform {
backend "s3" {
skip_credentials_validation = true
skip_region_validation = true
skip_requesting_account_id = true
skip_metadata_api_check = true
force_path_style = true
}
}
EOF
- name: generate.py render (YAML -> 显式 HCL + tfvars)
working-directory: ${{ env.VPS_ROOT }}
run: python3 scripts/generate.py render
- name: Terraform init
working-directory: ${{ env.ENV_DIR }}
env:
AWS_ACCESS_KEY_ID: ${{ secrets.TF_STATE_ACCESS_KEY }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.TF_STATE_SECRET_KEY }}
run: |
set -euo pipefail
if [ -n "${{ secrets.TF_STATE_BUCKET }}" ]; then
terraform init -input=false \
-backend-config="endpoint=${{ secrets.TF_STATE_ENDPOINT }}" \
-backend-config="bucket=${{ secrets.TF_STATE_BUCKET }}" \
-backend-config="key=ai-workspace/terraform.tfstate" \
-backend-config="region=${{ secrets.TF_STATE_REGION || 'us-east-1' }}"
else
echo "::warning::未配置远端 state使用本地 state仅适合一次性演示destroy 需同一次运行)"
terraform init -input=false
fi
- name: Terraform ${{ github.event.inputs.terraform_action || 'apply' }}
working-directory: ${{ env.ENV_DIR }}
env:
TF_VAR_vultr_api_key: ${{ secrets.VULTR_API_KEY }}
run: |
set -euo pipefail
terraform ${{ github.event.inputs.terraform_action || 'apply' }} -auto-approve -input=false
- name: generate.py inventory (terraform output + YAML -> cmdb.json + inventory.ini)
if: ${{ (github.event.inputs.terraform_action || 'apply') == 'apply' }}
working-directory: ${{ env.VPS_ROOT }}
run: python3 scripts/generate.py inventory
- name: Build deploy matrix from cmdb.json
id: matrix
if: ${{ (github.event.inputs.terraform_action || 'apply') == 'apply' }}
working-directory: ${{ env.ENV_DIR }}
run: |
set -euo pipefail
hosts="$(jq -c 'keys' cmdb.json)"
echo "hosts=${hosts}" >> "$GITHUB_OUTPUT"
echo "count=$(jq 'length' cmdb.json)" >> "$GITHUB_OUTPUT"
echo "matrix hosts: ${hosts}"
- name: Upload CMDB + inventory artifact
if: ${{ (github.event.inputs.terraform_action || 'apply') == 'apply' }}
uses: actions/upload-artifact@v4
with:
name: ai-workspace-cmdb
path: |
${{ env.ENV_DIR }}/cmdb.json
${{ env.ENV_DIR }}/inventory.ini
if-no-files-found: error
# ---------------------------------------------------------------------------
deploy:
name: Deploy ${{ matrix.host }} (on-host bootstrap)
needs: provision
if: ${{ needs.provision.outputs.count != '0' && (github.event.inputs.run_deploy == 'true' || github.event.inputs.run_deploy == null) }}
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
host: ${{ fromJSON(needs.provision.outputs.hosts) }}
steps:
# all-in-one 是“在目标主机本地执行”的模型host 内部 ansible-playbook -c local
# 自动走离线包加速)。从 runner 远程跑 all-in-one 会撞 roles/agent_skills 的
# delegate_to: localhost写 runner 本地 /root故 deploy 改为 ssh 到主机本地
# 跑官方引导脚本——与用户 self-host 的 curl|bash 完全同一路径。
- name: Download CMDB (host IP source)
uses: actions/download-artifact@v4
with:
name: ai-workspace-cmdb
path: cmdb
- name: Configure SSH
run: |
set -euo pipefail
mkdir -p ~/.ssh
printf '%s\n' "${{ secrets.ANSIBLE_SSH_KEY }}" > ~/.ssh/id_ed25519
chmod 600 ~/.ssh/id_ed25519
- name: Wait for host SSH
run: |
set -euo pipefail
ip="$(jq -r '.["${{ matrix.host }}"].ip' cmdb/cmdb.json)"
echo "Waiting for ${{ matrix.host }} (${ip}:22) ..."
for _ in $(seq 1 60); do
if nc -z -w 5 "$ip" 22; then echo "SSH up"; exit 0; fi
sleep 10
done
echo "::error::Timed out waiting for ${ip}:22"; exit 1
- name: Run on-host bootstrap (curl | bash, local-mode install)
env:
DEEPSEEK_API_KEY: ${{ secrets.DEEPSEEK_API_KEY }}
NVIDIA_API_KEY: ${{ secrets.NVIDIA_API_KEY }}
OLLAMA_API_KEY: ${{ secrets.OLLAMA_API_KEY }}
run: |
set -euo pipefail
ip="$(jq -r '.["${{ matrix.host }}"].ip' cmdb/cmdb.json)"
user="$(jq -r '.["${{ matrix.host }}"].ansible_user // "root"' cmdb/cmdb.json)"
echo "Bootstrapping ${{ matrix.host }} (${user}@${ip}) on-host ..."
ssh -i ~/.ssh/id_ed25519 \
-o StrictHostKeyChecking=accept-new \
-o ServerAliveInterval=20 -o ServerAliveCountMax=15 \
-o ConnectTimeout=20 \
"${user}@${ip}" \
"DEEPSEEK_API_KEY='${DEEPSEEK_API_KEY}' \
NVIDIA_API_KEY='${NVIDIA_API_KEY}' \
OLLAMA_API_KEY='${OLLAMA_API_KEY}' \
bash -lc 'curl -sfL https://install.svc.plus/ai-workspace | bash -'"
# ---------------------------------------------------------------------------
dns:
name: Sync Cloudflare DNS
needs: [provision, deploy]
if: ${{ needs.provision.outputs.count != '0' && (github.event.inputs.run_dns == 'true' || github.event.inputs.run_dns == null) }}
runs-on: ubuntu-latest
steps:
- name: Checkout infra (playbooks)
uses: actions/checkout@v4
with:
repository: ${{ env.INFRA_REPO }}
ref: ${{ github.event.inputs.infra_ref || 'main' }}
token: ${{ secrets.INFRA_REPO_TOKEN || github.token }}
path: infra
- name: Download CMDB + inventory
uses: actions/download-artifact@v4
with:
name: ai-workspace-cmdb
path: cmdb
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install Ansible
run: pip install --quiet ansible
- name: Reconcile Cloudflare DNS from inventory
working-directory: ${{ env.PLAYBOOKS_DIR }}
env:
CLOUDFLARE_DNS_API_TOKEN: ${{ secrets.CLOUDFLARE_API_TOKEN }}
run: |
set -euo pipefail
# 只为本次新建的 ai_workspace 组主机同步 A 记录(域名取各主机
# service_domains hostvar内容取其公网 IP不动其它静态记录。
ansible-playbook \
-i "${GITHUB_WORKSPACE}/cmdb/inventory.ini" \
update_cloudflare_dns.yml \
-e '{"cloudflare_dns_source_hosts":["ai_workspace"],"cloudflare_dns_static_records":[]}'