1436ee9092
2 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
a645d464e6
|
fix(docker): use system Node in componentized builders + retry apk add (#28888)
* fix(docker): use system Node in componentized builders + retry apk add Two failure modes in the componentized image builds (backend, migrations, gateway) on project-releaser, with the same root cause: 1. The builder-stage `apk add` was missing `libatomic`. `prisma generate` triggers prisma-client-py's `nodeenv`, which downloads the latest stable Node.js at build time. Node 26.1.0 (last passing build on 2026-05-20) did not dynamically link `libatomic.so.1`. Node 26.2.0 (current latest) does, and the Wolfi builder doesn't ship libatomic — so `npm install prisma@…` fails with `node: error while loading shared libraries: libatomic.so.1` and exit 127. Retrying or pinning the Node version is a treadmill; the root issue is that nodeenv decides the Node version at build time. Fix: add `nodejs npm` to the builder-stage `apk add` so prisma-client-py uses Wolfi's own Node via its default `PRISMA_USE_GLOBAL_NODE=true`. The legacy `docker/Dockerfile.non_root` already does this; the componentized Dockerfiles regressed it. Setting `PRISMA_USE_GLOBAL_NODE=true` in ENV redundantly nails the intent so a future env override can't silently re-enable nodeenv's download. 2. Transient `apk.cgr.dev` mirror flakes during the arm64 leg of multi-arch builds cause individual package fetches to fail mid-install (we saw `nss-db-2.43-r7: remote server returned error (try 'apk update')` and similar for libzstd1, libogg, binutils in this run). None of the componentized Dockerfiles wrap `apk add` in a retry loop. Fix: wrap every `apk add` (builder + runtime, all three files) in the same `for i in 1 2 3; do … && break || sleep 5; done` loop that the legacy `docker/Dockerfile.non_root` already uses. Affected files all have the same shape — backend, migrations, gateway — because they're three near-identical componentizations of the original monolithic proxy Dockerfile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(docker): trim verbose comments on builder Node setup Same fix, leaner comments. The apk-add note is 3 lines now (was 8), and the PRISMA_USE_GLOBAL_NODE bullet matches the existing UV_* comment style. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(docker): make apk-add retry loop fail loudly on exhaustion Greptile flagged that the retry pattern `apk add ... && break || sleep 5` exits 0 when all three attempts fail, because `sleep 5` is the last executed command. A persistent apk.cgr.dev outage would produce a silently "successful" RUN layer with no packages installed, followed by cryptic "command not found" errors in downstream RUN steps. Fix: explicitly fail on the third miss before sleeping. Same pattern in all six retry loops (3 files × builder + runtime). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Yassin Kortam <yassinkortam@Yassins-MBP.localdomain> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
014cb8fa9d
|
feat: add componentized proxy deployment with gateway, backend, ui, and migrations (#27557)
Split the monolithic LiteLLM proxy into independently scalable Kubernetes components to allow separate horizontal scaling of the LLM data plane and management API surfaces - Add DatabaseURLSettings pydantic-settings model that assembles DATABASE_URL (and optional DATABASE_URL_READ_REPLICA) from discrete DATABASE_* env vars before Prisma initializes, supporting both IAM token auth (minting short-lived RDS tokens) and password auth; replaces the CLI-only path that componentized entrypoints bypass - Add gateway component (port 4000) that trims the proxy route table to the LLM data-plane surface (chat, embeddings, completions, audio, realtime, provider passthroughs, health/metrics) via an allowlist applied inside the lifespan context so plugin-registered routes are captured - Add backend component (port 4001) that exposes the management/admin surface (keys, users, teams, orgs, spend analytics, model management, SSO, audit logs) with a complementary allowlist - Add ui component — Next.js static export served by nginx (port 3000) with RSC payload routing, asset prefix aliasing, and SPA fallback for dashboard routes - Add migrations component with dedicated Dockerfile that runs prisma migrate deploy via a Helm pre-install/pre-upgrade Job, eliminating per-pod schema contention on the Prisma advisory lock - Add Helm chart (helm/litellm) with separate Deployments, Services, HPAs, and ConfigMap for each component; shared _helpers.tpl emits DATABASE_*, IAM_TOKEN_DB_AUTH, REDIS_*, and DISABLE_SCHEMA_UPDATE env vars from chart values; ingress template routes traffic to the correct component by path prefix - Add comprehensive tests for DatabaseURLSettings covering IAM auth, password auth, read replica fallbacks, operator-pinned URL preservation, and percent-encoding; add coverage test asserting gateway + backend allowlist union equals the full proxy route set - Add pydantic-settings>=2.14.1 as a proxy extra dependency and update liccheck allowlist Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu> |