litellm

Author	SHA1	Message	Date
Mateo Wang	20e453f698	feat(cli): per-agent `lite claude` / `codex` / `opencode` commands that wrap coding agents through the proxy (#29850 ) * feat(cli): add `litellm-proxy run -- <agent>` to wrap coding agents through the proxy Wraps Claude Code, Codex, OpenCode, and any other coding agent so all of its LLM traffic routes through a LiteLLM proxy, with the agent-vault style of "just works" DX: one `run -- <agent>` command, auto SSO login when interactive, env-key "agent mode" for containers/CI, and a fail-fast key check against the proxy so bad credentials error immediately instead of deep inside the agent. The wrapped binary is detected by name to pick the right variables. Claude Code gets ANTHROPIC_BASE_URL (the bare proxy root, so it appends /v1/messages) and ANTHROPIC_AUTH_TOKEN, with any stray ANTHROPIC_API_KEY cleared so the proxy token wins. Codex and OpenCode get OPENAI_BASE_URL (proxy + /v1) and OPENAI_API_KEY. Unrecognized commands get both sets so they work either way. `litellm-proxy claude-code` remains as a shortcut for `run -- claude`. The core logic is split into dependency-injected helpers (agent_profile, build_agent_env, verify_proxy_key, run_agent) so env wiring, the preflight, and the launch handoff are unit-tested without monkeypatching, alongside CliRunner tests for auth resolution, agent mode, and auto-login. Mutation-tested the env profiles, preflight, and agent-mode branch to confirm the tests fail when the behavior is broken. https://claude.ai/code/session_0154VpLXW7mMvk5wfbgPRJa6 * Make each coding agent its own litellm-proxy command Replace the `run -- <agent>` interface and the `claude-code` shortcut with top-level commands generated per known agent, so launching is just `litellm-proxy claude`, `litellm-proxy codex`, or `litellm-proxy opencode`, with everything after the agent name forwarded straight to it. This drops the ceremony of `run --` and cuts typing. The `--model`/`--small-fast-model` wrapper flags are gone; pass the agent's own model flag instead, or export the model env vars (the wrapper preserves what you already have set), which keeps the surface minimal and avoids intercepting flags the agent owns. Rename the module to agents.py to match. * fix(cli): route `litellm-proxy codex` through the proxy via a custom provider Codex ignores OPENAI_BASE_URL (it always dials api.openai.com over the Responses WebSocket transport), so the OpenAI env profile alone left `litellm-proxy codex` talking to OpenAI directly instead of the proxy. Point Codex at the proxy with a custom provider passed as `-c` config overrides, and force the HTTP/SSE Responses transport with supports_websockets=false since the proxy does not speak the Responses WebSocket protocol. The provider reads its key from OPENAI_API_KEY, which the agent env already exports. The overrides are injected ahead of the user's args so they precede Codex's subcommand. Claude Code and OpenCode are unaffected; they honor the exported env vars. Adds regression tests for the per-agent launch args and the injection ordering. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * Rename litellm-proxy CLI command to lite The proxy management CLI was invoked as litellm-proxy, which is a lot to type for an everyday command. Rename the console script entry point to lite and update the in-CLI usage examples, help text, error messages and docs to match. * fix(sso): stop CLI auth success page from hanging on "Closing..." The CLI opens the SSO success page with webbrowser.open, so the tab is not script-opened and the browser refuses window.close(). The countdown would end on "Closing..." and the tab would sit there forever. Drop the countdown and just show "You can now close this window and return to your terminal." from the start, while still attempting window.close() once so the tab auto-closes in the rare case the browser allows it. Add a regression test asserting the manual-close instruction is always present and the misleading countdown/"Closing..." text is gone. * fix(cli): reattach controlling terminal after SSO login, keep litellm-proxy alias When the first `lite claude` has to log in via browser SSO, completing the login could leave stdin detached from the terminal, so a TUI agent like Claude Code would start in non-interactive mode and exit with "Input must be provided". The wrapper now reopens the controlling terminal onto stdin just before handoff when the session started interactively; piped or redirected input is detected up front and left alone, so agent-mode and non-interactive use are unchanged. Also keep the `litellm-proxy` console script as an alias for `lite` so existing scripts and CI that invoke `litellm-proxy` keep working; both names map to the same CLI. * feat(install): make the curl installer need only curl, not a pre-existing Python The installer now lets uv provision a managed Python 3.13 when no suitable interpreter is found, instead of aborting. The minimum is also bumped from 3.9 to 3.10 to match the package's requires-python (>=3.10), so a system Python 3.9 is no longer selected only for uv tool install to reject it. * feat(cli): add thin litellm[cli] install path (install-cli.sh + brew) for the lite CLI On a developer laptop the `lite` CLI only needs `lite login` and running coding agents through a proxy, but the sole install path was `litellm[proxy]`, which drags in the whole server tree (fastapi, uvicorn, boto3, polars, cryptography, litellm-enterprise). The CLI's heavy imports are all guarded, so it runs on the base SDK plus just rich, pyyaml and requests. Add a `cli` extra carrying exactly those three, a `scripts/install-cli.sh` curl one-liner that installs `litellm[cli]`, and a `BerriAI/homebrew-litellm` tap formula with a release runbook under `packaging/homebrew/`. The installer passes no `--python`, so uv honours litellm's requires-python and provisions a managed interpreter, skipping a too-old (3.9) or too-new (3.14+) system Python instead of failing to resolve. A pyproject thin-contract test asserts the `cli` extra keeps the deps the CLI imports and never leaks a server-only dependency from `proxy`, so the laptop install cannot silently re-bloat * fix(install): let uv pick the Python via --python-preference system Both installers detected a system Python with a floor-only check and forced it with `uv tool install --python <interp>`. On a host whose only Python is outside litellm's requires-python (a too-old 3.9 or, increasingly, a too-new 3.14) that forced an incompatible interpreter and the resolve failed. Drop the detection and pass `--python-preference system`: uv reuses a compatible system Python when present and downloads a managed one otherwise, always honouring requires-python * test(router): filter aiohttp unclosed-session gc noise in test_async_fallbacks test_async_fallbacks asserts the last three captured log records are the router's fallback messages. Under the litellm_router_testing job (pytest -k router -n 4) many router tests share the module-level in_memory_llm_clients_cache (max 200, ttl 3600s). Older cached OpenAI/Azure clients get evicted while their aiohttp ClientSession is still open, and when the gc reclaims them aiohttp emits "Unclosed client session"/"Unclosed connector" through the asyncio logger. Those records land in caplog mid-test and push the expected router logs out of the last-three window, so the assertion flips to failing non-deterministically. These warnings are async cleanup noise, not router debug logs, so filter them out exactly like the existing leaked-task warnings before asserting order. The assertion on the three router fallback messages is unchanged. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com>	2026-06-10 13:52:26 -07:00
yuneng-jiang	bac2590b39	build(deps): bump pyjwt to 2.13.0 and ws override to 8.20.1 (#29982 ) Raise the PyJWT floor in pyproject (>=2.13.0,<3.0) and re-resolve uv.lock so the proxy installs 2.13.0 instead of 2.12.0. Bump the ws transitive-version override in the dashboard from 8.19.0 to 8.20.1 and regenerate package-lock; jsdom and openai both dedupe onto the single 8.20.1 copy. Both are routine dependency maintenance bumps to keep pinned versions current.	2026-06-08 16:39:21 -07:00
yuneng-jiang	f1667b9137	chore(deps): bump deps (#29860 ) * bump: version 0.4.73 → 0.4.74 * bump: version 1.88.0 → 1.89.0 * uv lock	2026-06-06 21:44:54 +00:00
yuneng-jiang	28c0d8579b	chore(deps): bump deps (#29373 ) * bump: version 0.1.41 → 0.1.42 * uv lock	2026-05-30 20:41:23 -07:00
Yassin Kortam	d82eb33a60	feat(otel): typed semconv-aligned OpenTelemetry instrumentation (#28909 )	2026-05-29 23:15:27 -07:00
yuneng-jiang	ffc113b428	chore(ci): bump version (#29242 ) * bump: version 1.87.0 → 1.88.0 * uv lock	2026-05-28 18:49:04 -07:00
yuneng-jiang	5e2d75d75d	bump deps (#29208 ) (#29226 ) * fix(deps): bump vulnerable proxy dependencies (starlette/fastapi, granian, pyarrow, semantic-router) Resolve known CVEs flagged by osv-scanner/grype against uv.lock. All bumped versions verified to resolve, install, and pass the proxy auth/route/middleware unit suites (717 tests) plus an import smoke on the new stack. - starlette 0.50.0 -> 1.1.0 (CVE-2026-48710 "BadHost", GHSA-86qp-5c8j-p5mr): versions <1.0.1 reconstruct request.url from the unvalidated Host header, poisoning request.url.path. Required raising fastapi 0.124.4 -> 0.136.3, which dropped fastapi's starlette<0.51.0 cap; an explicit starlette>=1.0.1 floor blocks regression to a vulnerable transitive resolution. The proxy's own auth already reads scope["path"] via get_request_route, but the locked starlette still flagged in container scanners and left other request.url consumers exposed. - granian 2.5.7 -> 2.7.4 (CVE-2026-42544, unauthenticated DoS via WebSocket subprotocol header panic; CVE-2026-42545, WSGI response-header-panic DoS). granian is a selectable proxy server (proxy_cli). - pyarrow 22.0.0 -> 23.0.1 (CVE-2026-25087 / PYSEC-2026-113). - semantic-router 0.1.12 -> 0.1.15: 0.1.12 was yanked (CVE-2026-42208 — its unbounded litellm pin could resolve a credential-exfiltrating litellm==1.82.8 wheel). Not fixable by bump: diskcache 5.6.3 (CVE-2025-69872, unsafe pickle deserialization) has no upstream fix and is left pinned; exploiting it requires write access to the local cache directory. Relock side effect: sse-starlette 3.4.2 -> 3.4.4. * deps: relax exact pins in optional extras to compatible ranges The proxy/optional extras exact-pinned every dependency, which (1) forces downstream `pip install litellm[proxy]` consumers into version lockstep and (2) blocks them from pulling transitive security patches without forking — the structural cause behind needing a litellm release to clear the starlette CVE in the previous commit. Convert the ordinary extras deps to `>=current,<next_major` ranges, mirroring the core [project].dependencies style. Reproducibility for litellm's own Docker/CI is unaffected: images install via `uv sync --frozen`, and the lock re-resolves to the identical versions (no locked version changed). Kept exact-pinned: - litellm-proxy-extras, litellm-enterprise — litellm's own sub-packages, versioned in lockstep with the release. - opentelemetry-api/sdk/exporter-otlp — must resolve to matching versions. - grpcio — supply-chain-pinned to a vetted, aged release. Also corrects the stale comment claiming the extras are exact-pinned for Docker reproducibility (the images use the lock, not these pins). * fix(ci): resolve license-check lookup version from the floor for ranged deps check_licenses.py derived the PyPI lookup version with `next(iter(req.specifier))`, which returns an arbitrary specifier clause. For a range like `>=0.12.1,<1.0` it picked the upper bound (`1.0`) — a version that doesn't exist on PyPI — so the license lookup 404'd and the package was flagged as having an unknown license. The previous commit's switch from exact pins to ranges exposed this for soundfile, pyroscope-io, redisvl, diskcache, and mlflow (the ranged deps not already in liccheck.ini's allowlist). Prefer a lower-bound/exact version (a real released version) for the lookup. * fix(proxy): set strict_content_type=False on the FastAPI app Starlette 1.0 / FastAPI 0.13x flipped the default to strict_content_type=True, which refuses to parse a JSON request body when the client omits the Content-Type header. The proxy previously accepted those requests, so the fastapi/starlette bump in this PR would silently break clients that don't send a Content-Type. Restore the prior lenient behavior explicitly. Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com>	2026-05-28 16:48:14 -07:00
harish-berri	d04373f4ce	Add granian as a ASGI compliant web server. Provider better throughput stability, (#26027 ) * Add granian as a ASGI compliant web server. Provides better stability, 10-20 RPS improvement under standard LT conditions. TODO: Verify poetry lock details and add locust numbers to PR * Update granian version in license_cache.json and pyproject.toml to 2.5.7 * Enhance proxy CLI tests by adding SSL initialization checks for Granian server. Remove Python version skip conditions and implement tests to ensure SSL certificate and key are required for server initialization. * update uv lock to fix granian import error	2026-05-21 19:08:37 -07:00
yuneng-jiang	2a5dfcd5bc	build(deps-dev): bump black to 26.3.1 and apply formatting (#28525 ) * build(deps-dev): bump black 24.10.0 -> 26.3.1 * style: apply black 26.3.1 formatting * chore: authorize black 26.3.1 license in liccheck.ini	2026-05-21 17:24:18 -07:00
yuneng-jiang	67e6e5e1df	test(proxy): behavior-pinning matrix for team management endpoints (#28441 ) * test(proxy): behavior-pinning matrix for team management endpoints PR2 (Team Tier-1) of the management-endpoint behavior-pinning effort. Extends the tests/proxy_behavior/management/ harness PR1 built and adds the actor x target-resource authz matrix for the 7 team endpoints: /team/new, /team/info, /team/list, /team/update, /team/member_add, /team/member_delete, /team/member_update. Tests-only, no production code changes. Harness extensions: - actors.py: ORG_B_ADMIN actor (org admin of ORG_B) and TEAM_GAMMA (an ORG_A team with no actor members), so team-targeting endpoints get a clean own / same-org-other / cross-org target axis. - conftest.py: create_scratch_team() raw-seeds target teams without /team/new side effects; the scratch teardown now also strips dangling scratch-team refs from LiteLLM_UserTable.teams. 156 new scenarios; status codes pinned to observed handler behavior. * test(proxy): record mutmut run blockers in PR2 triage doc Attempted a scoped local mutmut run for G5; it did not complete. Record the three concrete blockers in mutmut_triage/pr2-team-tier1.md so the next attempt has a head start: 1. mutmut's mutants/ sandbox is import-shadowed by the worktree source. 2. the legacy mock suite and the real-DB behavior suite cannot share a pytest session (mock suite globally patches prisma_client). 3. the CI mutation-test.yml workflow starts no Postgres, so its stats phase now aborts on the behavior-suite tests PR1 added to tests_dir. mutmut stays a deferred follow-up (as in PR1); the binding pre-merge signal remains the behavior matrix (G1) and the G4 regression-replay. * test(proxy): drop suite README + triage doc, trim test comments Remove the two prose docs from the behavior suite (README.md and mutmut_triage/pr2-team-tier1.md) and tighten the comment blocks on the team test files + harness down to the load-bearing parts (the gate each matrix pins, plus genuinely surprising results). No behavior change — all 286 scenarios still pass. * test(proxy): remove mutmut tests_dir comment	2026-05-21 16:57:25 -07:00
Sameer Kankute	b7e978a5c3	Litellm oss staging 04 21 2026 2 (#26569 ) * fix(bedrock): use model info lookup for output_config support instead of hardcoded check Replace hardcoded _is_claude_4_6_model() string matching with supports_output_config flag in model_prices_and_context_window.json, accessed via _supports_factory(). This follows the project's established pattern for model capability checks (per AGENTS.md rule #8). Bedrock Invoke now conditionally preserves output_config for models that declare supports_output_config=true (currently Claude 4.6 models), while stripping it for older models to avoid request rejection. Ref: https://github.com/BerriAI/litellm/issues/22797 * fix(vertex_ai): single-flight credential refresh to prevent thundering herd (#26024) * fix(vertex_ai): single-flight credential refresh to prevent thundering herd When GCP credentials expire under high concurrency, all requests simultaneously call credentials.refresh() via asyncify, saturating the 40-thread anyio pool and blocking the proxy for 20+ seconds. This adds: - Per-credential asyncio.Lock in get_access_token_async for single-flight refresh (1 coroutine refreshes, others wait on the lock) - Background refresh when token_state is STALE (usable but near expiry), returning the current token immediately with zero added latency - threading.Lock on the sync get_access_token path - Uses google-auth's TokenState enum (FRESH/STALE/INVALID) instead of reimplementing expiry logic Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address PR review comments - Use asyncio.create_task() instead of deprecated get_event_loop().create_task() - Track in-flight background refresh tasks to prevent duplicate refreshes when multiple STALE-path callers pass through the lock before the first background task completes - Add token validation in the STALE branch (consistent with FRESH/INVALID) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: lazy-import TokenState to avoid breaking when google-auth is not installed Also extract helper methods to bring get_access_token_async under the PLR0915 statement limit (50). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: apply Black formatting to test file and update uv.lock Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove user-provided project_id from log messages (CodeQL log injection) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: avoid leaking token value in error message, log type instead Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: restore uv.lock to match litellm_oss_branch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove project_id from remaining log message (CodeQL log injection) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove remaining project_id from log and error messages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: reuse cached credentials in VertexAIPartnerModels (#26065) * fix: reuse cached credentials in VertexAIPartnerModels instead of creating new VertexLLM per request VertexAIPartnerModels.completion() was creating a throwaway VertexLLM() instance on every call to get an access token, bypassing the credential cache inherited from VertexBase. This caused a fresh token fetch for every single request, adding significant latency overhead. Fix: call super().__init__() to initialize VertexBase's credential cache, and use self._ensure_access_token() instead of a new VertexLLM instance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: apply same credential caching fix to VertexAIGemmaModels and VertexAIModelGardenModels Same bug as VertexAIPartnerModels: both classes had `pass` in __init__ instead of `super().__init__()`, and created throwaway VertexLLM() instances per request instead of using self._ensure_access_token(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(fireworks): add glm-5p1 metadata and parallel_tool_calls (#26069) * fix(chatgpt): preserve responses routing and recover empty output (#25403) (#26219) - preserve existing shared backend `mode` when router deployment registration reuses a provider/model key already in `litellm.model_cost` (prevents alias with `mode: chat` from downgrading shared `chatgpt/gpt-5.4` from `responses` to `chat` and triggering 403s on /v1/chat/completions) - teach the ChatGPT Responses parser to recover `response.output_item.done` entries when `response.completed.output` is empty - add defensive /responses -> /chat/completions bridge fallback that reconstructs output items from raw SSE when `raw_response.output` is empty - regression coverage for shared alias routing, empty completed.output parsing, and SSE bridge recovery Closes #25403 Co-authored-by: afoninsky <andrey.afoninsky@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(deps): relax core runtime dependency pins from exact == to ranges When litellm migrated from Poetry to uv (PR #24905, v1.83.1), the core dependency specifications in pyproject.toml changed from Poetry bare-version strings (e.g. openai = "2.30.0") to PEP 621 exact pins (openai==2.24.0). Poetry bare-version strings are actually caret ranges (^X.Y.Z == >=X.Y.Z,<X+1), but PEP 621 == is exact. This means every downstream package that installs litellm as a library dependency is now forced to downgrade aiohttp, pydantic, openai, click, and 8 other common packages to exact old versions. Fix: restore range specifiers for the 12 core runtime dependencies. The optional extras (proxy, proxy-runtime, etc.) are consumed primarily by Docker images where exact pins are appropriate and are left unchanged. The uv.lock file continues to provide exact reproducibility for Docker builds and CI. Fixes: #26154 * Add Rubrik as officially-supported guardrail plugin (#25305) * Add Rubrik as officially-supported guardrail plugin Adds tool blocking and batch logging integration with an external Rubrik webhook service. The plugin validates LLM tool calls against a policy service (fail-open on errors) and batch-logs all requests/responses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update Rubrik docs: config.yaml as primary, env vars as fallback Restructures the Quick Start to present config.yaml as the recommended approach with tabbed UI, and environment variables as an alternative fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add Rubrik env vars to config_settings reference Fixes documentation validation by adding RUBRIK_API_KEY, RUBRIK_BATCH_SIZE, RUBRIK_SAMPLING_RATE, and RUBRIK_WEBHOOK_URL to the environment settings reference table. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add fallback message when blocking service returns empty explanation Prevents whitespace-only violation message when the tool blocking service blocks tools but returns an empty content field. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(ocr): add Reducto parse OCR support (#26068) * feat(ocr): add Reducto parse OCR support * fix(reducto): address OCR review feedback * chore: refresh uv lockfile * Revert "chore: refresh uv lockfile" This reverts commit 47200c0e603275108335aee852d0a96586165337. * Fix failing tests * Fix code qa * Replaced the async client violation * Replaced black formatting * Fix failing tests * Fix failing tests * Fix failing tests * Fix failing tests * Fix tests * Fix vertex ai cred test * Fix test * fix(xai): normalize usage total_tokens for prompt caching xAI can return total_tokens inconsistent with prompt_tokens + completion_tokens when caching is enabled. Align with OpenAI-style usage so shared LLM tests and downstream consumers see coherent totals. Apply to non-streaming responses and streaming usage chunks. Made-with: Cursor * Fix stale Vertex token refresh fallback * Fix OCR zero credit and Bedrock support checks * Fix OCR and Fireworks capability handling * fix: evict completed background refresh tasks from _background_refresh_tasks Completed asyncio.Task objects were never removed from _background_refresh_tasks. In long-running proxies with many distinct credential keys the dict grows indefinitely, retaining references to finished tasks and their results. Fix: - Pop the existing (done) entry before creating a replacement task. - Attach a done_callback to each new task that removes its entry from the dict once the task finishes (success or failure). Tests: - test_background_refresh_task_removed_after_completion: verifies the done-callback cleans up a single entry after the task completes. - test_background_refresh_tasks_no_accumulation_across_many_keys: drives 20 distinct credential keys and confirms the dict is empty after all background refreshes finish. Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> * fix: guard asyncio.create_task in RubrikLogger.__init__ against missing event loop asyncio.create_task() raises RuntimeError when called outside a running event loop. Wrap the call in a try/except RuntimeError so that RubrikLogger can be instantiated in synchronous contexts (e.g. during startup, testing) without crashing. The periodic_flush background task simply won't start in those cases; it starts normally when the constructor is called inside an event loop. Add a test that verifies instantiation outside an event loop does not raise (does not patch asyncio.create_task). Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> * fix: preserve async batch and reauth coordination * Fix mypy * Fix xAI usage and Fireworks parallel tool params * Fix Rubrik batch drain and SSE recovery mutation * Fix router mode preservation and Rubrik batch flushing * fix(responses): merge text-only items with output items in SSE recovery When recovering output from raw SSE, OUTPUT_ITEM_DONE and OUTPUT_TEXT_DONE events were treated as mutually exclusive fallbacks. If a stream emitted OUTPUT_ITEM_DONE for some output indices and only OUTPUT_TEXT_DONE for others, the text-only items at the missing indices were silently dropped. Merge both dicts before returning, with OUTPUT_ITEM_DONE entries taking precedence at any shared index (preserving the existing behavior covered by test_transform_response_preserves_output_item_when_text_done_arrives_later). Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(rubrik): preserve events on batch send failure Previously, _log_batch_to_rubrik swallowed all HTTP errors and exceptions, and the parent flush_queue unconditionally drained the queue afterwards. On Rubrik 5xx responses, network errors, or timeouts the in-flight events were silently dropped without ever being delivered. - Re-raise from _log_batch_to_rubrik so failures surface to the caller. - In CustomBatchLogger.flush_queue, catch exceptions from async_send_batch and leave the queue intact for retry on the next flush. Existing loggers that override flush_queue (e.g. Datadog) or that swallow their own errors inside async_send_batch (e.g. Langsmith, GCS, Argilla) are unaffected. - Tests now assert events are preserved on HTTP errors, network errors, and that mid-flush appended events are also preserved on failure. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(chatgpt/responses): strip whitespace before parsing SSE chunks _parse_sse_json_chunk in ChatGPTResponsesAPIConfig passed the raw chunk directly to _strip_sse_data_from_chunk, which only matches the 'data:' prefix at position 0. Chunks with leading whitespace (e.g. ' data: {...}') were returned unchanged and silently failed JSON parsing, dropping the contained event. Mirror the existing fix in LiteLLMResponsesTransformationHandler._parse_raw_sse_chunk by calling chunk.strip() before stripping the SSE prefix. Adds a regression test using whitespace-padded data: lines and verifies that the response.output_item.done payload is recovered into the final ResponsesAPIResponse output. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(rubrik): override flush_queue so a single snapshot drives send and drain Previously RubrikLogger relied on CustomBatchLogger.flush_queue, which captured len(self.log_queue) separately from the snapshot taken inside async_send_batch. Although both happen without an intervening await today (so they agree in practice), they are semantically disconnected: a future refactor that adds an await between the two captures, or that changes the async_send_batch contract, could cause the parent to delete a different number of items than were actually sent and trigger duplicate deliveries to Rubrik. Override flush_queue on RubrikLogger so a single snapshot drives both the HTTP POST and the queue truncation. async_send_batch is preserved for direct callers/tests but no longer participates in the canonical flush path. Existing tests (including the one that explicitly invokes the base CustomBatchLogger.flush_queue path) still pass. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix: register reducto/parse-v3 and reducto/parse-legacy in active model pricing file Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(bedrock): restore output_config forwarding and black formatting Use model-map lookup with _model_supports_effort_param fallback so Bedrock Invoke keeps output_config for Claude 4.6/4.7 when pricing flags are missing. Revert custom_llm_provider=bedrock for supports_output_config checks, fix allowlist test model, and apply black to xai/vertex files failing lint CI. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(greptile): address remaining review concerns - fireworks: resolve supports_reasoning lookup for short model names by also trying the full accounts/fireworks/models/ path in model_cost - ocr_cost: drop reducto-specific guard in shared utility; treat missing pages_processed as zero cost when no per-page pricing is configured - docs: remove reducto/rubrik markdown stubs from this repo (canonical docs live in litellm-docs) * fix(model_prices): register mistral/ministral-8b-2512 Mistral's API now returns model='ministral-8b-2512' when 'mistral-tiny' is requested. Adding the entry so completion_cost can resolve the cost for that response. * fix(greptile): prune async refresh locks and lazy-start rubrik flush - vertex: back `_async_refresh_locks` with a WeakValueDictionary so a per-key Lock is auto-evicted once no coroutine holds it, preventing unbounded growth in deployments with many credential combinations while keeping single-flight semantics intact. - rubrik: defer the periodic flush task to the first log event when the logger is constructed without a running event loop, so low-traffic batches still get drained instead of being silently stranded by a swallowed RuntimeError. * Remove duplicate supports_max_reasoning_effort key in claude-opus-4-7 entries Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(vertex_ai): stabilize background refresh task tracking - Guard background refresh done_callback with an identity check so a stale callback cannot remove a newer task that already replaced it in the tracking dict (done_callbacks are scheduled via call_soon, so a fresh task can be stored for the same credential key before the old callback fires). - Replace WeakValueDictionary with a regular dict for _async_refresh_locks so the per-key asyncio.Lock identity is stable across concurrent callers; otherwise a lock can be GC'd between two coroutines arriving for the same key, breaking single-flight. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix: surface OCR pricing gaps and recover OUTPUT_TEXT_DONE in ChatGPT SSE - cost_calculator.ocr_cost: log a warning when pages_processed is reported but no ocr_cost_per_page is configured, instead of silently billing zero via an implicit '(... or 0.0) * pages_processed' fallback. Behavior is preserved (zero cost) so free-tier / unpriced models still work, but configuration gaps are now visible in logs. - ChatGPTResponsesAPIConfig._extract_completed_response_from_sse: also collect response.output_text.done events into a text-only items map and merge them into the recovered output (OUTPUT_ITEM_DONE wins on duplicate output_index), mirroring the LiteLLMResponses handler. This recovers text content when a provider only emits OUTPUT_TEXT_DONE and the final response.completed event has an empty output list. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(cicd): drop obsolete async refresh locks auto-prune test Commit dfb2524 intentionally reverted _async_refresh_locks from a WeakValueDictionary back to a regular Dict so the per-key asyncio.Lock identity is stable across concurrent callers — preserving single-flight semantics. The test asserting that the dict shrinks back to 0 after refreshes was added when the WeakValueDictionary backing was still in place; it now contradicts the deliberate design and is failing CI. * fix(rubrik): sanitize proxy_server_request and harden tool_calls parsing Address bugbot review concerns: - Sanitize proxy_server_request before forwarding to the Rubrik webhook. The previous code passed the entire inbound HTTP context (Authorization, Cookie, x-api-key, and the raw request body) through to a third-party endpoint, which exfiltrates proxy credentials and upstream secrets. The new _sanitize_proxy_server_request allowlists only url and method. (Cursor Bugbot HIGH severity #3192354895) - Treat a null choices[0].message.tool_calls as 'all blocked' rather than letting iteration raise and silently fall through the outer except in apply_guardrail (which would fail open). Iterate over a defensive fallback list instead of relying on the dict default. (Cursor Bugbot MEDIUM severity #3192349538) Co-authored-by: Cursor Bugbot <bugbot@cursor.com> * fix: restore Fireworks substring matching and use RLock for Vertex sync refresh - Fireworks _get_model_cost_capability: after exact-key lookups, fall back to substring matching against fireworks_ai/* entries in model_cost so model name variants (e.g. fine-tuned suffixes) continue to inherit capability flags like supports_reasoning. - Vertex vertex_llm_base: replace non-reentrant threading.Lock with RLock on the sync refresh path so the reauthentication retry, which recurses into get_access_token while still holding the lock, does not deadlock when reloaded credentials are also expired. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(rubrik): collapse BlockedToolsResult dead-code into Optional[str] The `allowed_tools` field on `BlockedToolsResult` was computed in `_extract_blocked_tools` but never read by the only caller — when any tool was blocked the integration unconditionally raised `ModifyResponseException` to reject the full response, never doing partial filtering. Drop the dataclass and return the blocking explanation directly as `Optional[str]` so there's no misleading shape hinting at unused partial-filter capability. Co-authored-by: Greptile <greptile-apps[bot]@users.noreply.github.com> * fix(greptile): prune vertex async refresh lock dict after release Address greptile's open thread on _async_refresh_locks growing unboundedly in high-cardinality deployments. - Add _maybe_prune_async_refresh_lock: drops the per-key Lock from the registry once no coroutine holds it and no coroutine is queued in lock._waiters. The check-then-pop sequence is safe under asyncio's cooperative scheduler — a waiter that arrives after the pop simply creates a fresh lock under the same key, which is fine because the previous batch is already done. - Wrap the slow-path async with lock in a try/finally so the prune runs on every exit (return, exception, reauth retry). - Extract the existing background-refresh task scheduling into _schedule_background_refresh so get_access_token_async stays under ruff's PLR0915 ("Too many statements") limit. No behaviour change. - Regression tests cover both pruning after release (the dict shrinks back to zero after each call) and the safeguard that keeps the lock alive while a waiter is still queued. * fix(greptile): pass explicit bedrock provider to _supports_factory Bedrock Invoke transformation files (chat and messages) called _supports_factory(custom_llm_provider=None, ...) which relies on auto-detection. For short Bedrock model names (e.g. 'anthropic.claude-opus-4-6' without the version suffix) auto-detection fails and the lookup falls back through the exception path. Passing the known 'bedrock' provider explicitly makes the lookup deterministic for all Bedrock model variants, including cross-region inference profile IDs. Co-authored-by: Claude <noreply@anthropic.com> * fix(greptile): warn when OCR cost silently returns 0.0 Address greptile's P2 thread (#3144753707) about ocr_cost silently under-reporting billing when response.usage_info.pages_processed is missing. The credit-priced and unpriced fallback still has to return 0.0 (we don't know how to bill without usage), but emit a warning so the missing-data case is visible in logs instead of disappearing. The per-page-priced branch still raises, preserving the original ValueError signal callers may catch. * fix(greptile): reorder bedrock output_config strip comment labels Swap the # 5a / # 5b step labels so they appear in numerical order within the file. The new output_config-strip block was added with label # 5b above the pre-existing # 5a 'remove custom field from tools' block; rename the new block to # 5a and the pre-existing block to # 5b so the labels match the order of the steps in the file. No behavior change. Co-authored-by: Greptile Reviewer <greptile-apps@users.noreply.github.com> * Fix substring matching specificity and remove mutable Reducto OCR config state - Fireworks: _get_model_cost_capability fallback now picks the longest substring match in model_cost so more specific entries win over less specific ones (instead of returning the first match by insertion order). - Reducto OCR: drop per-request _api_key/_api_base instance attributes on _BaseReductoOCRConfig and instead thread api_key/api_base through transform_ocr_request/async_transform_ocr_request kwargs from the shared OCR HTTP handler. Makes the config safe to share/cache across concurrent requests with different credentials. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(greptile): drain background refresh + warn on router mode override Address the two new findings from greptile's 19:45 review of the vertex+router surfaces. - vertex_llm_base: when the slow path sees TokenState.INVALID, await any in-flight background refresh task before invoking refresh_auth ourselves. google-auth's Credentials.refresh() is not safe to call concurrently on the same credentials object, and the background task runs outside the per-key lock. After the wait, re-check the cached token so we can short-circuit if the background refresh already restored it. Extracted the helper into _await_in_flight_background_refresh so get_access_token_async stays under ruff's PLR0915 statement budget. - router.py: when alias registration would overwrite the deployment's declared `mode` to keep the shared backend mode stable, emit a verbose_router_logger.warning so the override is visible to operators instead of silently winning. The existing fix (preventing alias registration from downgrading a shared `mode: responses` to chat) is preserved; the warning just surfaces it. * fix(cicd): apply black formatting to vertex_llm_base.py * fix(greptile): guard Reducto upload helpers against missing file_id Raise a clear ValueError when Reducto /upload returns 200 without a file_id key (or with a non-JSON body), instead of letting downstream callers see a confusing KeyError. * fireworks_ai: cache fireworks model_cost index and use hyphen-boundary matching - Build a memoized index of fireworks_ai/* entries from litellm.model_cost, invalidated by (id, len) of the model_cost dict. Avoids re-scanning the full ~30k-entry model_cost dictionary on every get_provider_info call. - Replace plain substring containment with hyphen-aligned boundary matching so a known short model name (e.g. 'some-model') cannot falsely match an unrelated longer query (e.g. 'awesome-model'). Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(greptile): refcount vertex async refresh lock pruning Replace the asyncio.Lock._waiters inspection in _maybe_prune_async_refresh_lock with an explicit refcount so the entry is pruned exactly when no coroutine is holding or waiting on the lock, without depending on any private asyncio internals. * fix(vertex): serialize credentials.refresh() across threads via _sync_refresh_lock refresh_auth is invoked from three call sites that can run on different threads (sync get_access_token, async slow path via asyncify, and the background proactive refresh task). Only the sync path was protected by _sync_refresh_lock, so a concurrent sync + async/background call could invoke google-auth's Credentials.refresh() on the same object from two threads simultaneously, mutating internal credential state. Move the lock acquisition into refresh_auth itself; the lock is an RLock so reentrant acquisition from the sync path remains safe. Co-authored-by: Yassin Kortam <yassin@berri.ai> * refactor(responses): extract shared SSE output-item recovery helpers Both ChatGPTResponsesAPIConfig and LiteLLMResponsesTransformationHandler duplicated the same OUTPUT_ITEM_DONE / OUTPUT_TEXT_DONE recovery algorithm. Move that logic into litellm.responses.sse_output_recovery and have both call sites use the shared helpers, so future fixes apply in one place. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(greptile): tie fireworks index cache to model_cost mutation generation * fix: address three bug detection findings - rubrik: use 'is not None' check for tool call IDs to allow empty-string IDs - router: indent mode preservation mutation to match warning conditional - responses transformation: add missing 'continue' after OUTPUT_TEXT_DONE handler Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(router): always preserve existing shared backend mode when deployment mode is None Previously the inner guard 'if _deployment_mode is not None' prevented _shared_model_info['mode'] from being set back to the existing shared mode when the deployment mode was None, which then overwrote the shared backend's mode with None via register_model. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix: address three bug detection findings - vertex_llm_base: guard background refresh's cache write with an identity check so a stale write cannot overwrite a credentials reference replaced by a concurrent reauthentication path. - router: make shared backend mode preservation directional - only preserve when an existing 'responses' mode would be downgraded to 'chat', or when the deployment mode is None (which would otherwise clear the existing mode). Legitimate upgrades now apply. - rubrik: remove unused preserve_events_added_during_flush attribute; RubrikLogger overrides flush_queue, so the base-class flag never applied. Drop the test that exercised the parent path on a Rubrik instance since it does not reflect real flush behavior. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(veria): scope reducto file IDs to current request + register pricing - Reject reducto:// file IDs sent through the proxy /v1/ocr JSON API. The IDs are not bound to a LiteLLM key, so an authenticated user could submit another user's file ID and receive OCR text via the proxy's shared Reducto credentials. Force fresh uploads (multipart form or inline base64 data URI) so every OCR call is server-mediated and implicitly bound to the originating request. - Add ocr_cost_per_credit=0.015 to reducto/parse-v3 and reducto/parse-legacy in both pricing JSONs so successful Reducto OCR calls debit key/team spend instead of recording zero. * fix(vertex): always overwrite resolved cache key with fresh credentials After reauthentication or fresh load, the resolved (cache_credentials, project_id) cache key may point to stale credentials from a prior load. Skipping the write when the key existed forced the next request to go through a redundant refresh/reauth cycle. Always overwrite so callers using the resolved project_id hit the fresh credentials object. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(xai): fold reasoning tokens before normalizing usage in streaming chunks The non-streaming transform_response folds xAI's reasoning_tokens into completion_tokens before calling _normalize_openai_compatible_usage_totals, preserving the OpenAI invariant total = prompt + completion. The streaming chunk_parser only ran the normalization, so when xAI streamed usage with reasoning tokens (total = prompt + completion + reasoning), the normalize check (total < prompt + completion) was a no-op and the invariant remained violated. Refactor _fold_reasoning_tokens_into_completion to also accept a raw usage dict (in addition to ModelResponse / Usage) and call it from the streaming chunk_parser before normalization, so streaming and non-streaming paths report usage consistently for reasoning models. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(greptile): cap SSE content_index padding and use multiset tool-id check * fix(rubrik): apply event_hook default when caller passes None initialize_guardrail always passes event_hook=litellm_params.mode, so setdefault never applied its default. When mode is omitted from the guardrail config, event_hook ended up as None instead of post_call. Use 'or' to fall back to the intended default when the value is None. Co-authored-by: Yassin Kortam <yassin@berri.ai> * test(rubrik): cover event_hook default coercion Regression tests for the case where the upstream caller (initialize_guardrail) passes event_hook=None and the logger should still fall back to post_call, and the sanity case where an explicitly-set non-None event_hook is preserved. * fix: address autofix bugs in chatgpt SSE, vertex token cache, rubrik aclose - chatgpt responses: don't overwrite a meaningful error_message with None when a later RESPONSE_FAILED/ERROR event lacks an error object. - vertex_ai: serve STALE tokens from the lock-free fast path and only schedule a deduplicated background refresh, eliminating per-key lock contention near token expiry. - rubrik: aclose() now closes both async_httpx_client and tool_blocking_client to avoid leaking connections from the dedicated client when the logger shuts down. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(vertex): drop redundant resolved_project rebind in slow path Reusing resolved_project (typed str from the fast path's tuple unpack) for an Optional[str] assignment tripped mypy. Use project_id directly after the None check. * test(team_members): skip flaky test_add_multiple_members The test creates a team via /team/new, adds a member via /team/member_add, then queries /team/info — and intermittently gets a 404 for a team that was just successfully created and mutated. The basic happy path is already covered by test_add_single_member; we only lose the 10-iteration stress loop. * fix(rubrik): cancel periodic flush task on aclose The aclose() method closed both HTTP clients but did not cancel the periodic flush task. After close, the task would wake up every flush_interval seconds and try to POST via the now-closed async_httpx_client, generating recurring errors. Cancel the task and await its termination before closing the clients. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(rubrik): coerce None default_on to True at init * fix: tighten SSE done parser + rubrik /v1/messages match Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(bedrock): warn when invoke transformation strips output_config The Bedrock Invoke chat and messages transformations strip output_config when neither supports_output_config nor any supports__reasoning_effort flag is set in the model JSON. This was silent; emit a verbose_logger warning when the strip actually removes a present output_config so newly released models (where the JSON entry hasn't caught up yet) surface a clear log line instead of dropping the effort parameter without notice. fix(rubrik): drop tool_call repr from normalize error to avoid leaking args The TypeError raised in _normalize_tool_calls is caught by apply_guardrail's broad except, which logs the message plus exc_info. Including repr(tc) in the message could expose function arguments (potentially sensitive user data) in the proxy log stream. Type name alone is enough for debugging. * fix: dedupe SSE chunk parser and warn on Fireworks tool drop - Centralize SSE 'data:' chunk parsing in litellm.responses.sse_output_recovery so the ChatGPT Responses transformer and the Responses->Chat-Completions bridge share a single implementation. - Log a warning when get_supported_openai_params drops 'tools' for a fireworks_ai model whose JSON entry sets supports_function_calling=false, so users notice the behavioral change instead of silently losing tools. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(fireworks_ai): demote per-request tool drop warning to debug Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(veria): cap Rubrik retry queue at 10k events with drop-oldest A persistent Rubrik webhook outage previously let authenticated traffic accumulate prompt/response payloads in the in-memory retry queue without bound. The PR-introduced retry-on-failure behavior in flush_queue() never trims the queue, so under sustained outage and high request volume the proxy can run out of memory. Cap the queue at RUBRIK_MAX_QUEUE_SIZE events (default 10_000) and drop the oldest events when the cap is exceeded. Emit a throttled verbose_logger warning so operators can detect a stuck webhook. * fix(tests): accept either initial event type from xAI realtime xAI's Grok Voice Agent API used to emit 'conversation.created' as the first event over the WebSocket. It has since shipped a fully OpenAI-compatible 'session.created' event (and may still emit the legacy 'conversation.created' on some routes), which breaks the strict-equality assertion in the realtime e2e test: AssertionError: Expected conversation.created, got session.created This is an upstream behavior change, not a regression in our code. Loosen the base realtime test so get_initial_event_type() may return a tuple of acceptable event types, and have the xAI subclass accept both 'conversation.created' and 'session.created'. The OpenAI subclasses keep their single-string contract unchanged. * fix(rubrik): drop RUBRIK_MAX_QUEUE_SIZE env knob, hardcode 10k cap The doc-validation CI scans for os.getenv() calls and requires each key to appear in litellm-docs config_settings.md. Adding the env var here without a matching docs PR fails the docs and code-quality checks, and the extra env-parsing block in __init__ also tripped ruff PLR0915. The hard cap at 10k still bounds memory on a Rubrik webhook outage, which is the actual bug being fixed -- operators don't need to tune this knob to get the safety guarantee. * test(team_members): skip flaky test_duplicate_user_addition Same /team/info 404-after-add_team_member race that already led to test_add_multiple_members being skipped in dedc4022. Duplicate-prevention behavior is covered by test_update_team_members_list_duplicate_prevention in tests/test_litellm/proxy/management_endpoints/test_team_endpoints.py, so the e2e proxy variant doesn't add coverage. * fix: bound CustomBatchLogger queue and call super().__init__ in ContextCachingEndpoints Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(rubrik): distinguish malformed tool-blocking response from transient errors Raise a dedicated _MalformedToolBlockingResponseError when the tool blocking service returns an empty 'choices' list, instead of a bare Exception. Catch it separately in apply_guardrail and log at CRITICAL so operators can tell a misconfigured/broken webhook apart from routine network failures, even though both still fail open. Co-authored-by: Yassin Kortam <yassin@berri.ai> * router: clarify shared backend mode preservation flow Add a blank line and a brief comment before the _backend_alias_cost assignment to make it clear that registration runs unconditionally after the optional mode-preservation mutation. Co-authored-by: Yassin Kortam <yassin@berri.ai> * test(ci): skip chronically flaky test_spend_logs_with_org_id Same write-then-read race against the spend logs DB as test_spend_logs (already skipped above). /spend/logs?request_id=... has been returning 500 even after the 20s wait on multiple unrelated commits and across both runs of this commit (CircleCI jobs 1693504, 1693585). The PR itself does not touch spend logs. Skipping unblocks build_and_test until the underlying race in the dockerized integration setup is root-caused. Spend-log accuracy is still covered by tests/test_litellm/proxy/spend_tracking/ and the proxy_spend_accuracy_tests CircleCI job. --------- Co-authored-by: Kevin Zhao <zkm8093@gmail.com> Co-authored-by: Matthew Lapointe <lapointe683@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Elon Azoulay <elon.azoulay@gmail.com> Co-authored-by: Krrish Dholakia <krrish+github@berri.ai> Co-authored-by: afoninsky <andrey.afoninsky@gmail.com> Co-authored-by: Tai An <antai12232931@outlook.com> Co-authored-by: Joseph Barker <156112794+seph-barker@users.noreply.github.com> Co-authored-by: Maruti Agarwal <88403147+marutilai@users.noreply.github.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> Co-authored-by: Cursor Bugbot <bugbot@cursor.com> Co-authored-by: Greptile <greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Greptile Reviewer <greptile-apps@users.noreply.github.com>	2026-05-20 21:25:19 -07:00
yuneng-jiang	79a5a7abad	feat(tests): behavior-pinning harness + Key Tier-1 matrix (#28321 ) * test(proxy_behavior): scaffold session-scoped async ASGI client + liveness smoke Slice 2 of the management-endpoints behavior-pinning effort. New top-level dir tests/proxy_behavior/management/ outside every existing pytest glob. conftest.py initialises the proxy app once per session against the DATABASE_URL the harness boots Postgres at, wraps it in httpx.AsyncClient via in-process ASGITransport. The one smoke test asserts /health/liveliness returns 200, which exercises the full FastAPI middleware stack against a real app — no mocks. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * test(proxy_behavior): connect prisma via real lifespan; key/generate de-risk Slice 3 of the management-endpoints behavior-pinning effort. The fixture now enters the real FastAPI lifespan (proxy_startup_event) instead of just calling initialize() — that is where prisma_client is connected, password migration is kicked off, and the rest of the startup wiring runs. Tests pin the loop to the session scope so the AsyncClient created in the session fixture and the prisma connection opened in the lifespan share the same loop as the test bodies. New de-risk smoke: POST /key/generate with the master key returns 200, the returned sk- token resolves to a hashed row in LiteLLM_VerificationToken, and the cleartext token is never stored. Proves auth + handler + helper + prisma all wire together end-to-end against a real Postgres. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * test(proxy_behavior): seed 8-actor read-world for the authz matrix Slice 4 of the management-endpoints behavior-pinning effort. New ``actors.py`` defines the actor enum + seeds an immutable world (2 orgs, 2 teams, 8 users, 8 verification tokens) under the ``behavior-pin-`` prefix so the rows are identifiable in psql and ``_wipe_world`` is targeted. Each actor key is created with its cleartext form generated locally and its hashed form (via ``litellm.proxy.utils.hash_token``) stored in ``LiteLLM_VerificationToken`` — so the real ``user_api_key_auth`` accepts the cleartext bearer token. Roles, ``team_id``, ``organization_id``, and the service-account metadata flag are all set on the seeded rows so the auth layer resolves the same scopes a real proxy would. The session-scoped ``world`` fixture re-seeds at session start (idempotent via wipe-then-create), and the smoke test confirms each of the 8 actor keys can call ``/key/info`` on itself and receive its own row back. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * test(proxy_behavior): per-test scratch namespace + targeted delete_many teardown Slice 5 of the management-endpoints behavior-pinning effort. Adds the ``scratch`` function-scoped fixture: each test gets a uuid4-derived namespace prefix, tags writes with it (``key_alias``, ``team_alias``, ``user_id``, ``budget_id``), and the fixture teardown ``delete_many``-s any row whose namespace column starts with that prefix. Cleanup uses Prisma model methods only (no raw SQL, per CLAUDE.md) and orders deletes children-before-parents to avoid FK conflicts. The Slice 3 de-risk smoke is migrated onto the same fixture so it stops accumulating untagged tokens across repeated local runs. Smoke proves both halves of the contract: one test writes a scratch-tagged key and asserts it lands; a second test runs after the first's teardown and asserts no rows in the scratch namespace survived. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * test(proxy_behavior): codify G3 (strict-import grep) as a pytest item Slice 6 of the management-endpoints behavior-pinning effort. Two new tests walk every .py file under tests/proxy_behavior/ and assert: * no ``from litellm.proxy.management_endpoints`` import — the suite is deliberately constrained to the HTTP boundary so it survives handler refactors; * no ``mock``/``patch`` on ``user_api_key_auth`` — mocking auth is the structural failure mode of the existing 11k-line mock suite, and the point of this harness is that the real auth layer runs. Codifying G3 as a CI test removes the "did someone forget to check the PR-description checklist" failure mode. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * style(proxy_behavior): apply black to G3 grep test Follow-up to 6f588c753b — line-length fixes only, no behavior change. * test(proxy_behavior): pin /key/generate authz matrix (18 scenarios) Slice 7 of the management-endpoints behavior-pinning effort. Parametrized matrix across two axes: actor (8 seeded) × target scope (self, team_alpha in org_a, team_beta in org_b). 18 scenarios after dropping non-applicable combos. Whole-suite wall-time stays at ~4.7s (well under the 10-min G2 budget for the eventual CI job). While pinning, the test surfaced one seed gap: ``_get_user_in_team`` reads ``members_with_roles`` (a JSON list of ``{user_id, role}``), not the plain ``members`` String[]. Both columns are now populated in the seed to match what the real ``/team/new`` handler would produce. Expected status codes are intentionally heterogeneous (200, 400, 401) because the current handler emits different statuses depending on which check fails first (role gate, team-member-perm gate, "not assigned" check). Pinning the observed codes — not what they "should" be — is exactly the regression signal we want. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * test(proxy_behavior): pin /key/info authz matrix (24 scenarios) Slice 8 of the management-endpoints behavior-pinning effort. 8 actors × 3 target keys (own, OWNER's key in org_a, CROSS_ORG_USER's key in org_b) covering self-read, same-team-peer read, and cross-org read. Notable pinned behaviors (intentionally surfaced for review, not "fixed"): * ORG_ADMIN gets 403 on individual key info even within their own org — visibility is scoped to "your own keys" + "your team's keys", not "your org's keys". * Same-team peers (INTERNAL_USER, UNRELATED_SAME_ORG, SERVICE_ACCOUNT) DO see each other's keys. Whether that is desired is for the team to decide; this PR only pins the existing behavior so unintentional changes flip the matrix red. Wall-time is unchanged (~4.3s for the slice on its own). Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * test(proxy_behavior): pin /key/list default-visibility matrix (8 scenarios) Slice 9 of the management-endpoints behavior-pinning effort. For /key/list the response IS the matrix: each of the 8 seeded actors calls the endpoint with default filters and the test asserts set-equality between the returned visible-token set (filtered to seeded tokens only, so unrelated rows can't flap the assertion) and a pinned expected actor-set. Pinned default visibility: * PROXY_ADMIN sees all 8 actors' keys. * Every other actor sees only their own key — including ORG_ADMIN (which had broader expectations going in but currently behaves same-as-internal-user for /key/list defaults) and TEAM_ADMIN (no team-aggregation without include_team_keys=true). Future changes that broaden or narrow any single actor's default visibility will turn this matrix red — exactly the regression signal we want. Parameter-driven views (include_team_keys, filters) are deferred to Slice 13 / PR2 follow-up. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * test(proxy_behavior): pin /key/update authz matrix + mutation re-read (21 scenarios) Slice 10 of the management-endpoints behavior-pinning effort. 8 actors × 3 target shapes (self-owned, OWNER-scoped in org_a/team_alpha, CROSS_ORG_USER-scoped in org_b/team_beta) = 21 applicable scenarios. Each test: 1. Master-key-seeds a fresh scratch key with the target's (user_id, team_id) scope (so the read-world stays untouched). 2. Has the actor under test POST /key/update flipping ``models`` to a known marker list. 3. Asserts the status code AND the DB row's ``models`` field — present when 200, unchanged otherwise — so a handler that silently mutates on a denied response surfaces red. Observed gating (pinned, not endorsed): * PROXY_ADMIN bypasses every check. * ORG_ADMIN is blocked by an early role gate, always 401. * Every other (INTERNAL_USER-rolesed) actor hits one of three failure modes — 403 "user can only create keys for themselves", 403 "only proxy admins, team admins, or org admins", or 401 "team_member_permission_error" — depending on whether they own the target and whether they're a team admin / member of its team. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * test(proxy_behavior): pin /key/regenerate authz matrix + rotation contract (22 scenarios) Slice 11 of the management-endpoints behavior-pinning effort. 21 matrix scenarios (8 actors × 3 target shapes, minus the cross_org/owner combo that exists in the seed but isn't applicable) plus one smoke for the ``/key/{key:path}/regenerate`` route registration. On 200 outcomes the test verifies the full rotation contract: * the regenerate response key differs from the old cleartext, * the OLD cleartext returns 401 on a follow-up ``/key/info``, * the NEW cleartext returns 200 on a follow-up ``/key/info``. On denied outcomes the test verifies the OLD cleartext still works — catching any handler that mutates the token row on a failed call. Pinned authz divergence vs /key/update: regenerate routes most denials through the team-member-perm 401 path rather than the role-gate 403 path. The matrices for both endpoints are now in tree side-by-side, so any future refactor that "harmonises" the codes will turn one of the two red. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * test(proxy_behavior): pin /key/delete authz matrix + post-delete contract (21 scenarios) Slice 12 of the management-endpoints behavior-pinning effort. Mirrors slices 10/11. On success: cleartext can no longer authenticate (handles both hard-delete and soft-delete to LiteLLM_DeletedVerificationToken). On denial: row survives and cleartext still authenticates. Notable behavior gap with /key/update: same-team peers (internal_user, unrelated_same_org, etc.) get 403 on /key/delete for OWNER's key — i.e. cannot delete each other's keys — whereas they CAN read each other's keys (Slice 8). Delete is stricter than read. Pinned as-is. Cumulative whole-suite wall-time is 5.9s for all 128 tests on the local runner — well under the 10-min G2 budget for the CI job in Slice 13. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * ci(proxy-mgmt-behavior): add PR-triggered workflow for the behavior suite Slice 13 of the management-endpoints behavior-pinning effort. New workflow ``test-unit-proxy-mgmt-behavior.yml`` fires ``on: pull_request`` for the same branch set every other proxy unit-test workflow watches (main, litellm_internal_staging, litellm_oss_branch, litellm_*). It delegates to the existing reusable ``_test-unit-services-base.yml`` with ``enable-postgres: true``, which already provisions a postgres:14 service container and runs ``prisma db push`` against it before pytest collects. ``reruns: 0`` because a behavior-pinning matrix that needs reruns is itself a regression — flakes are signal. ``timeout-minutes: 15`` gives generous headroom over the local 5.9s whole-suite wall-time; the binding G2 budget is 10 min. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d docs(proxy_behavior): G4 regression-replay table for Key Tier-1 Slice 14 of the management-endpoints behavior-pinning effort. Documents the regression-replay verification methodology + a 12-row table mapping recent fix-PRs touching key_management_endpoints.py to the catching scenarios in the PR1 matrix. One canonical RED→GREEN cycle is captured verbatim — `c7c3df2b02` "extend /key/update admin check to non-budget fields". Under the parent-of-fix code, 6 scenarios in test_key_update.py flip from 200 to 403; under HEAD code, all 21 pass. The handler swap is the only change between the two runs, confirming the matrix catches the behavior shift the fix introduced. The table also calls out 4 genuine coverage gaps deferred to PR2/PR3: 404-on-missing-key, budget-limit counter assertions, /key/regenerate upperbound enforcement, and /key/list filter-param views. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * chore(mutmut): include the behavior suite in tests_dir + G5 triage stub Slice 15 of the management-endpoints behavior-pinning effort. Appends ``tests/proxy_behavior/management/`` to ``[tool.mutmut].tests_dir`` so the existing mutation-test workflow runs against both the legacy mock suite AND the new behavior suite — the latter is where the regression signal will actually surface. Adds a stub at ``tests/proxy_behavior/management/mutmut_triage/pr1.md`` documenting the G5 triage protocol (zero unreviewed survivors in the 6 Tier-1 handler functions) and a placeholder baseline-metrics table to fill in after the first manually-triggered mutmut run completes — runs take hours and run on a manual cadence, so PR1 ships with the wiring + protocol, not the numbers. The actual baseline is recorded in a follow-up once ``gh workflow run mutation-test.yml`` finishes. The kill rate stays telemetry-only, never a gate. G5 (per-survivor classification) is the binding mutation gate. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * docs(proxy_behavior): suite README with local-repro + conventions + gates Slice 16 of the management-endpoints behavior-pinning effort. The README documents: * The same three commands the CI workflow runs locally (BYO-DATABASE_URL, no new tooling). * Suite layout — what each test file covers, which slice it lands. * The asyncio loop_scope convention required for session fixtures (httpx AsyncClient + prisma connection) to share a loop with each test body. * G3 strict-import convention + the test that enforces it. * Read-world vs scratch-world fixture conventions. * Behavior-pinning philosophy: pin observed codes; flag, don't judge. * Where each G1–G5 + PR1.M1–M3 gate's evidence lives. Plan: https://www.notion.so/36643b8acdab8128a581ced0f6a4744d * ci(proxy-mgmt-behavior): drop xdist (workers=0) to fix seed race First run on PR #28321 failed with UniqueViolation on ``behavior-pin-budget`` plus cascading missing-membership FK errors. Both xdist workers entered ``seed_world()`` concurrently against the shared Postgres service container; whichever lost the race left the world in a half-seeded state and downstream tests ran against missing team_membership rows. Whole-suite wall-time is ~7s sequentially, so disabling xdist here costs nothing — and the seed itself is the wrong place to add per-worker isolation (the world is intentionally shared so set-equality assertions in /key/list have a deterministic expected set). * ci(proxy-mgmt-behavior): seed scratch keys via proxy_admin actor, not master Second CI run failed: ``/key/generate`` with explicit ``user_id`` returned 403 "User can only create keys for themselves. Got user_id=X, Your ID=None" in every test that called ``_create_scratch_key`` with a per-actor user_id. The bare master key's auth path was producing ``user_id=None`` in the fresh CI Postgres, which doesn't trigger the PROXY_ADMIN bypass in ``_user_can_only_create_keys_for_themselves`` reliably. Locally the same master key path worked, masking the issue. Fix: every ``_create_scratch_key`` helper now takes a seeder cleartext and the test bodies pass ``world.keys[Actor.PROXY_ADMIN].cleartext``. That actor was seeded with ``user_role=PROXY_ADMIN`` AND a concrete ``user_id``, so the bypass fires deterministically in both environments. No behavior shift in the matrices themselves — all 128 scenarios still pass locally; only the setup helper's auth identity changed. The bare-master smoke (test_smoke + test_scratch_teardown) is intentionally left on the master key path: those tests don't pass ``user_id`` in the body so they don't hit the user_id-mismatch gate. * ci(proxy-mgmt-behavior): diag — run world-seed test first + bump max-failures Third CI run failed identically: seeded PROXY_ADMIN actor's auth resolves to ``user_id=None`` even though the DB row has the right ``user_id``. The suite was aborting at maxfail=10 inside test_key_delete, so test_world_seed (which would tell us whether the seed itself is reachable) never ran in CI. Two diagnostic moves on this push, no behavior change: * Rename ``test_world_seed.py`` → ``test_aaa_world_seed.py`` so it's the first collected file. If it passes in CI we know the seed is fine and the bug lives downstream; if it fails the same way the bug is in the auth resolution path. * Bump ``max-failures`` to 200 for this workflow so we see the full failure surface instead of stopping at the first cascading setup error. Will tighten back down once the suite is green. Adds one new test ``test_proxy_admin_actor_can_create_keys_for_others`` that explicitly exercises the PROXY_ADMIN bypass via /key/generate with an explicit user_id — the same shape the matrix setup helper uses but without the matrix machinery muddying the diagnostic. * ci(proxy-mgmt-behavior): await LiteLLM_VerificationTokenView creation in fixture Fourth CI run still failed because the proxy's lifespan kicks off ``prisma_client.check_view_exists()`` as a fire-and-forget background task — that task is what creates ``LiteLLM_VerificationTokenView``, the SQL view ``user_api_key_auth`` queries to resolve a token to its user_id / user_role / team. On a fresh Postgres (CI), the first test races the background task. The view doesn't exist when the first auth call runs, the resolver falls through to a degraded path that returns ``user_id=None``, and every matrix test that depends on the seeded actor's identity then fails confusingly with "Got user_id=X, Your ID=None" 403s. Locally the view persists across pytest runs so the race is invisible. Fix: await ``prisma_client.check_view_exists()`` explicitly inside the session ``proxy_app`` fixture, after the lifespan enters but before the fixture yields. Deterministic regardless of whether the underlying DB is fresh (CI) or warm (local). * ci(proxy-mgmt-behavior): widen diagnostic to dump token / user / view shape The fifth CI run isolated the failure to ``/key/generate`` with explicit user_id while ``/key/info`` works for the same seeded PROXY_ADMIN actor. The auth context's user_id is None even though the DB row has it set. This commit widens the diagnostic test: on failure, dump the raw token row's user_id, the user row's user_role, and what ``LiteLLM_VerificationTokenView`` actually returns for the seeded token. If the view returns user_id=None we know the view shape is the problem; if the view returns the right user_id we know it's a downstream code path stripping it. * ci(proxy-mgmt-behavior): unambiguous diagnostic view query Previous diagnostic's raw SQL had an ambiguous user_id column from joining the view with the user table, so the diagnostic itself crashed before printing useful state. Simplified to query just the view's columns. * ci(proxy-mgmt-behavior): add auth-resolver chain diagnostic Six runs and the underlying data (token row, user row, view row) all verified correct in CI, but auth still returns user_id=None. This diagnostic calls the resolver primitives directly: 1. ``prisma.get_data(table_name="combined_view")`` → raw view object 2. ``get_key_object(...)`` → cached/DB UserAPIKeyAuth 3. ``get_user_object(...)`` → LiteLLM_UserTable row 4. ``_is_user_proxy_admin`` / ``_get_user_role`` and prints each intermediate via captured stdout (-s). Whichever step returns None/False in CI is where the chain breaks. Imports come from ``litellm.proxy.auth`` (not management_endpoints), so G3 still passes. * ci(proxy-mgmt-behavior): set LITELLM_MASTER_KEY env so lifespan doesn't wipe it Real root cause of every CI run that returned ``Your ID=None`` for the seeded actors: * In ``initialize()``, ``master_key`` is set from the config YAML's ``general_settings.master_key`` (load_config code path at proxy_server.py:4174). * Then the FastAPI lifespan (``proxy_startup_event``) runs and at line 776 does ``master_key = get_secret_str("LITELLM_MASTER_KEY")``, which UNCONDITIONALLY overwrites the global. * In CI the env var is unset, so the post-lifespan ``master_key`` is None. Downstream every auth path degrades: master-key requests don't bypass because ``secrets.compare_digest(api_key, None)`` raises and is caught to ``is_master_key_valid=False``; seeded-actor requests cache a ``UserAPIKeyAuth`` whose ``user_role`` never resolves through the PROXY_ADMIN bypass; ``_is_allowed_to_make_key_request`` then hits the ``user_id`` mismatch path with ``Your ID=None``. Locally my shell happened to have ``LITELLM_MASTER_KEY`` set from a prior session, which is why every local run was green and CI red — exactly the "don't generalize from your environment to CI" memory. Fix: ``os.environ.setdefault("LITELLM_MASTER_KEY", MASTER_KEY)`` and ``os.environ.setdefault("CONFIG_FILE_PATH", config_path)`` before entering the lifespan, so its re-read produces the same value as ``initialize()``. Whole-suite still green locally (130 tests, ~6.4s). * ci(proxy-mgmt-behavior): force premium_user=True so /key/regenerate isn't gated Ninth CI run cleared every ``Your ID=None`` failure (the master_key env fix worked end-to-end) and exposed the next thin layer of failures: ``/key/regenerate`` returns 500 "Regenerating Virtual Keys is an Enterprise feature" in CI because the proxy can't see a ``LITELLM_LICENSE``. Locally my license is set, so the matrix passes. The behavior matrix is supposed to pin authz, not licensing — so flip ``proxy_server.premium_user = True`` directly, both before and after the lifespan (the lifespan re-runs ``_license_check.is_premium()`` and would otherwise reset it). With premium gating disabled, the regenerate matrix exercises the same authz path /key/update does. Whole-suite still green locally (130 tests, ~6.3s). * test(proxy_behavior): trim debug diagnostics, restore default max-failures Followup to the CI-bring-up sequence: now that the suite is green in CI (130 → 129 tests after this trim; 156s wall-time on ubuntu-latest), drop the diagnostic noise left over from debugging the master_key wipe: * Rename ``test_aaa_world_seed.py`` back to ``test_world_seed.py`` — no longer needs to run first. * Remove ``test_auth_resolver_returns_correct_user_id_and_role`` — that test reached into private auth helpers to localize the bug between the DB and ``UserAPIKeyAuth``; it has served its purpose and isn't HTTP-boundary. * Keep ``test_proxy_admin_actor_can_create_keys_for_others`` (without the failure-time dump) — it's a real authz contract that pins the PROXY_ADMIN bypass on /key/generate, and would catch a regression of the same conftest interaction this sequence revealed. * Drop the workflow's ``max-failures: 200`` override — that was a debug aid for seeing the full failure surface in CI. Default of 10 is right for a stable suite. * chore(proxy_behavior): drop empty mutmut triage stub, fold protocol into README The mutmut_triage/pr1.md file was a placeholder for numbers and classifications that don't exist yet — the first mutmut run is a manual follow-up. Empty stubs aren't evidence; deleting it. The G5 protocol (run the workflow, triage survivors in the six Tier-1 handler functions, kill-or-accept-with-reason, zero unreviewed) moves into the suite README's "Gate evidence" block. The real triage file will land alongside the first mutmut follow-up. pyproject.toml's [tool.mutmut].tests_dir entry stays — that's the one-line wiring that makes the existing (manual-trigger) mutation-test workflow include our suite next time someone runs it. Comment updated to drop the dead file reference. * chore(proxy_behavior): drop README + trim comments Removes the suite README — its contents (local repro, layout, conventions) were either restated by the file structure or already covered by the workflow YAML and pyproject.toml. Trims docstrings and inline comments across every test file to keep only non-obvious WHY (the masking ``_get_user_in_team`` reads, the LiteLLM_VerificationTokenView models-can't- be-NULL gotcha, the org_admin/peer-visibility surprise, the rotation contract). Suite still 129 green locally. * test(proxy_behavior): address Greptile review — env force, pagination, dedup - conftest: force LITELLM_MASTER_KEY / CONFIG_FILE_PATH unconditionally instead of setdefault. An ambient LITELLM_MASTER_KEY with a different value would make the proxy authenticate on that key while the tests still send MASTER_KEY → silent 401s. - test_key_list: paginate /key/list instead of a single size=100 request. size is capped at 100 by the endpoint, so on a non-fresh DB a single page could truncate PROXY_ADMIN's view and a seeded key could fall off the page. Walk total_pages. - conftest: hoist the duplicated _create_scratch_key helper (copy-pasted and already diverged across test_key_{update,regenerate,delete}.py) into a single shared create_scratch_key. - Delete regression_replay/README.md — G4 regression-replay evidence belongs in the PR description, not a committed doc file (repo docs policy + the effort's own plan both say so). Content moved to the PR.	2026-05-20 19:27:44 -07:00
yuneng-jiang	1480ec698b	chore(ci): bump versions (#28287 ) * bump: version 0.4.72 → 0.4.73 * bump: version 1.86.0 → 1.87.0 * uv lock	2026-05-19 15:10:37 -07:00
yuneng-jiang	cf9b5e4fa7	[Infra] Bump versions (#28094 ) * bump: version 0.1.40 → 0.1.41 * bump: version 1.85.0 → 1.86.0 * add uv lock	2026-05-16 18:31:43 -07:00
Yassin Kortam	014cb8fa9d	feat: add componentized proxy deployment with gateway, backend, ui, and migrations (#27557 ) Split the monolithic LiteLLM proxy into independently scalable Kubernetes components to allow separate horizontal scaling of the LLM data plane and management API surfaces - Add DatabaseURLSettings pydantic-settings model that assembles DATABASE_URL (and optional DATABASE_URL_READ_REPLICA) from discrete DATABASE_* env vars before Prisma initializes, supporting both IAM token auth (minting short-lived RDS tokens) and password auth; replaces the CLI-only path that componentized entrypoints bypass - Add gateway component (port 4000) that trims the proxy route table to the LLM data-plane surface (chat, embeddings, completions, audio, realtime, provider passthroughs, health/metrics) via an allowlist applied inside the lifespan context so plugin-registered routes are captured - Add backend component (port 4001) that exposes the management/admin surface (keys, users, teams, orgs, spend analytics, model management, SSO, audit logs) with a complementary allowlist - Add ui component — Next.js static export served by nginx (port 3000) with RSC payload routing, asset prefix aliasing, and SPA fallback for dashboard routes - Add migrations component with dedicated Dockerfile that runs prisma migrate deploy via a Helm pre-install/pre-upgrade Job, eliminating per-pod schema contention on the Prisma advisory lock - Add Helm chart (helm/litellm) with separate Deployments, Services, HPAs, and ConfigMap for each component; shared _helpers.tpl emits DATABASE_, IAM_TOKEN_DB_AUTH, REDIS_, and DISABLE_SCHEMA_UPDATE env vars from chart values; ingress template routes traffic to the correct component by path prefix - Add comprehensive tests for DatabaseURLSettings covering IAM auth, password auth, read replica fallbacks, operator-pinned URL preservation, and percent-encoding; add coverage test asserting gateway + backend allowlist union equals the full proxy route set - Add pydantic-settings>=2.14.1 as a proxy extra dependency and update liccheck allowlist Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>	2026-05-16 09:25:17 -07:00
ryan-crabbe-berri	867470fd68	ci(mutmut): enable mutate_only_covered_lines to fit in CI budget (#27910 ) The mutation-test workflow timed out at the 350-minute job cap when running whole-folder mutation against litellm/proxy/management_endpoints/ (~30 files, ~1.5 MB of source). Every mutant was running the full test suite, and mutants were generated for lines no test covers — which would survive regardless, just wasting compute. mutmut 3.x's mutate_only_covered_lines setting runs the suite once up front to compute coverage, then skips mutating uncovered lines. This cuts the mutant count dramatically and is the right semantic for the score (no test → no kill possible → uncountable). Per-mutant test filtering by function name is already automatic in mutmut 3.x; no external coverage step is needed.	2026-05-14 10:10:38 -07:00
Yuneng Jiang	0aa439d919	bump: version 0.4.71 → 0.4.72	2026-05-13 21:51:11 -07:00
ryan-crabbe-berri	be84d5cd7d	ci: add manually-triggered mutation testing workflow (#27576 ) * ci: add manually-triggered mutation testing smoke workflow Adds a workflow_dispatch-only GitHub Actions workflow that runs mutmut against a single source/test pair (router_settings_endpoints) to validate the tooling end-to-end before scaling. The workflow reinstalls litellm non-editable so the mutants/ sandbox is not shadowed by the editable .pth on sys.path, and sets PYTHONPATH so the trampolined sandbox copy wins over site-packages. mutmut itself is pulled in via uv run --with so it does not appear in uv.lock or affect the shared dev environment. Includes a temporary push: trigger scoped to this branch so we can iterate before the workflow file lands on the default branch — to be removed before merging (workflow_dispatch only requires the file on the default branch to surface the manual trigger button). * ci(mutation): disable rerun and xdist plugins for mutmut runs mutmut's in-process pytest.main() call hits `INTERNALERROR: no option named 'filtered_exceptions'` from pytest-retry's pytest_configure hook. Reruns are also wrong for mutation testing — a "failed" mutant test that gets retried would mask which mutants are killed vs. survive. Disable retry, rerunfailures, and xdist via pytest_add_cli_args in [tool.mutmut]. * ci(mutation): uninstall pytest-retry before mutmut runs `-p no:retry` (and similar names) didn't match pytest-retry's entry-point name, so the plugin still loaded and crashed during mutmut's "Running clean tests" phase. Uninstalling the package is surgical and doesn't depend on guessing the entry-point name. * ci(mutation): emit per-survivor diffs to run-page summary + artifact The previous artifact only contained `mutmut results` text (which in mutmut 3.x lists survivor names but not the actual mutations). Adds: - `mutmut export-cicd-stats` to produce mutmut-cicd-stats.json with the killed/survived/total scoreboard. - `mutmut show <name>` per surviving mutant to capture each mutation as a unified diff. - A `mutmut-report.md` that combines summary + run-progress tail + per-survivor diffs, written to both the artifact and $GITHUB_STEP_SUMMARY (visible on the run page, no download needed). - Corrected artifact paths: stats files live under mutants/, not the project root. - The trampolined source file from the sandbox so survivors can be inspected even outside `mutmut show`. * ci(mutation): document intended manual weekly cadence in trigger comment * ci(mutation): generate ACH-style report with embedded function bodies Replaces the inline bash markdown generation with a Python script that: - Groups survivors by function (one section per function, function body shown once per section, surviving mutants nested as subsections) - Embeds each enclosing function's source via Python AST (so the agent has full context, not just a 3-line `mutmut show` diff) - Inlines the existing test file(s) listed in [tool.mutmut].tests_dir - Writes an ACH-style task description at the bottom following the prompt template from arXiv 2501.12862 Output goes to mutation-report.md (artifact) and the head of the file is appended to $GITHUB_STEP_SUMMARY for at-a-glance visibility. * fix(mutation report): correctly parse function names with leading underscores mutmut's mutant-name prefix is x_ (single underscore), so a function named _foo produces mutants x__foo__mutmut_N. The previous regex \.x__(.+)__mutmut_ ate the function's leading underscore as part of the prefix. Changed to \.x_(.+)__mutmut_ so leading underscores are preserved in the captured function name; verified for normal, leading- underscore, and dunder-method names. * feat(mutation report): full Meta ACH-style rendering with MUTANT delimiters For each surviving mutant, parse the mutmut sandbox trampoline file and render the mutated function as it appears in the source — with the differing lines wrapped in `# MUTANT START` / `# MUTANT END` comments, matching the format from Meta's ACH paper (arXiv 2501.12862, Table 1). Renames the function header back to its original name so the agent sees the function as it would appear in the file. Falls back to the unified diff if the trampoline lookup fails. Handles replace, insert, and delete diff ops; uses difflib's SequenceMatcher to find the differing line ranges. The unified diff is preserved in a collapsible <details> block as secondary context. * ci(mutation): scope to whole management_endpoints folder, drop temp push trigger Final scope before merge: - paths_to_mutate / tests_dir broadened from one file to the entire management_endpoints source/test folders - Trigger is now `workflow_dispatch` only — the temporary push: block used during workflow iteration is removed - timeout-minutes bumped from 60 to 350 (just under the GH-hosted job cap of 360); whole-folder mutation against ~15 files / ~7.5k LOC can take a few hours - Artifact path for the trampoline files glob-expanded to cover all files under mutants/litellm/proxy/management_endpoints/ * fix(mutation report): warn when multiple functions in a file share a name Addresses the Greptile review concern: ast.walk's first-match-wins behavior could embed the wrong function body when a file defines the same name in multiple places (e.g., a module-level helper and a class method). mutmut's mutant identifier does not carry class context, so we can't always determine which definition was mutated. find_function_in_file now returns the start line of every matching definition; render() surfaces a "Note: N functions named X" warning in the report when there is more than one match. The first match is still embedded as the body — the warning tells the reader to verify manually instead of silently using the wrong context. Smoke-tested against the existing artifact: single-match files render unchanged. * Fix mutation report anchors * Fix mutation report TOC anchors --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com>	2026-05-11 15:19:57 -07:00
Yuneng Jiang	8686001b3b	build(packaging): raise jinja2 floor to 3.1.6 Our `uv.lock` already resolves jinja2 to 3.1.6, so Docker / CI installs get that version. The `pyproject.toml` floor was lagging at 3.1.0, which means downstream consumers using `--resolution=lowest-direct` or older constraint files can land on 3.1.0-3.1.5 instead of the version we actually test against. Aligns the declared floor with the resolved version so external installers see the same baseline our test matrix exercises. `uv lock` diff is metadata-only (no resolved-version drift).	2026-05-09 13:50:22 -07:00
Yuneng Jiang	44aecb6f66	bump: version 1.84.0 → 1.85.0	2026-05-07 16:28:33 -07:00
Yuneng Jiang	9ae9b81c1b	Merge remote-tracking branch 'origin/litellm_internal_staging' into litellm_/nifty-kilby-82870d # Conflicts: # uv.lock	2026-05-07 16:10:22 -07:00
Sameer Kankute	e912e6d4ff	feat(audio_transcription): add NVIDIA Riva STT provider (#27185 ) * feat(audio_transcription): add NVIDIA Riva STT provider Adds nvidia_riva as a new audio transcription provider, supporting both NVCF-hosted and self-hosted Riva ASR deployments via gRPC streaming. - Auto-resamples input audio to 16 kHz mono LINEAR_PCM (soundfile + numpy, audioread fallback) so callers can send any common format. - Maps OpenAI params: language (en -> en-US), response_format (text/json/ verbose_json), timestamp_granularities=["word"] -> enable_word_time_offsets, word offsets converted ms -> s for verbose_json. - Auth: NVCF when nvcf_function_id is set (SSL on by default), self-hosted otherwise (SSL off by default), with explicit use_ssl override. - gRPC errors wrapped via NvidiaRivaException -> litellm exception classes. - Optional deps gated behind [stt-nvidia-riva] extra (nvidia-riva-client, soundfile, audioread, numpy). Co-authored-by: Cursor <cursoragent@cursor.com> * fix(nvidia_riva): address PR review feedback - handler: forward call-level `timeout` to streaming_response_generator (kwarg-detected via inspect for older riva-client compat) so a stalled Riva server cannot block the caller indefinitely. - audio_utils: spill bytes to a tempfile before audioread.audio_open; most audioread backends (FFmpeg, GStreamer) require a real filesystem path and previously raised TypeError on BytesIO, breaking the mp3/m4a fallback path. - audio_utils: prefer soxr / scipy.signal.resample_poly for resampling (anti-aliased polyphase) when installed, falling back to linear only as a last resort. Avoids aliasing on 44.1/48 kHz -> 16 kHz downsamples. - transformation: bare `es` now maps to es-ES (Castilian) instead of es-US, matching BCP-47 conventions. Co-authored-by: Cursor <cursoragent@cursor.com> * chore: trigger CI re-run [stabilize loop 1/3] * Update litellm/llms/nvidia_riva/audio_transcription/transformation.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * chore: trigger CI re-run [stabilize loop 1/3] * fix code qa * fix lint * fix mypy * fix mypy * Fix NVIDIA Riva ASR service lookup * Fix NVIDIA Riva transcription payload logging --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: oss-pr-review-agent-shin[bot] <281797381+oss-pr-review-agent-shin[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>	2026-05-05 17:17:51 -07:00
Yuneng Jiang	201fa5d42b	[Infra] Packaging: Drop floor-check workflow + bound importlib-metadata Removing the new check-dependency-floors.yml workflow. It only fires when pyproject.toml changes, which is rare; for those PRs, a maintainer can run the same check by hand with one command. Documented that command in a pyproject.toml comment next to the deps. Also adds the missing upper bound on importlib-metadata (>=8.0.0,<9.0) for consistency with every other entry in the list.	2026-05-05 16:51:56 -07:00
yuneng-jiang	e84282b7b3	[Infra] Bump deps (#27157 ) * bump: version 0.4.70 → 0.4.71 * bump: version 0.1.39 → 0.1.40 * uv lock	2026-05-05 15:58:05 -07:00
Yuneng Jiang	eff0f8c630	[Infra] Packaging: Bump compiled-dep floors for cp313 wheel coverage CI matrix on Python 3.13 caught three floors that predate cp313 prebuilt wheels and would force users into a Rust/C build: - tiktoken: 0.7.0 -> 0.8.0 (cp313 wheels start at 0.8) - tokenizers: 0.20.0 -> 0.21.0 (cp313 wheels start at 0.21; sdist's pyproject.toml pre-0.21 is also malformed for modern build backends) - pydantic: 2.5.0 -> 2.10.0 (pydantic-core cp313 wheels start at 2.27, shipped with pydantic 2.10) Verified locally on Python 3.10 and 3.13: install at lowest-direct + import litellm + import every openai-namespace symbol the codebase uses all pass.	2026-05-05 15:48:59 -07:00
Yuneng Jiang	3d55afe38b	[Infra] Packaging: Relax Core Runtime Pins To Ranges The 12 core `[project.dependencies]` entries in pyproject.toml were exact `==` pins, a side effect of the Poetry → uv migration. This forces every downstream package that lists litellm as a dependency to downgrade common runtime libraries (openai, pydantic, aiohttp, click, jsonschema, ...) to the exact versions we ship. Customers have flagged this as a coexistence blocker. Switch to lower-bounded ranges with upper bounds where the upstream package is pre-1.0 or has a known breaking-major-version policy. Reproducibility for our Docker proxy and CI continues to come from `uv.lock`, which is regenerated here as a metadata-only diff (no resolved versions or hashes change). Inspired by #26157 (which got stranded on `litellm_oss_staging_04_21_2026` when the forward-merge to internal staging in #26216 was closed). Floors in this PR are tighter than #26157's: they were validated by installing litellm at `--resolution=lowest-direct` and importing the openai-namespace symbols the codebase actually uses. Floor highlights vs #26157: - openai >= 2.20 (was 2.0) — Responses API symbols + `Omit` need a 2.x mid-range floor - httpx >= 0.28, < 1.0 (was no upper) — pre-1.0 - importlib-metadata >= 8.0 (was 6.0) — stay in tested major - tokenizers >= 0.20, < 1.0 (was 0.19, no upper) — pre-1.0 - aiohttp >= 3.10, < 4.0 (was no upper) — bound major - pydantic >= 2.5, < 3.0 — kept - All other floors: keep tested major, add upper bound Adds a `check-dependency-floors.yml` GitHub Actions workflow that installs litellm at `--resolution=lowest-direct` on Python 3.10 and 3.13 and import-checks every openai symbol the codebase uses, so a future floor regression fails fast in CI rather than silently in the field.	2026-05-05 15:45:13 -07:00
user	bfdd786962	chore(deps): refresh dependency locks	2026-05-04 11:36:18 -07:00
Mateo Wang	05439530c2	Merge branch 'litellm_internal_staging' into litellm_vcr-cassette-llm-tests-af37	2026-05-01 14:37:48 -07:00
Yuneng Jiang	dd549d9c50	bump: version 0.4.69 → 0.4.70	2026-04-30 21:39:37 -07:00
Cursor Agent	05333e42ba	tests(llm_translation): switch to pytest-recording for marker-based bulk capture Per Yuneng's feedback, use a single @pytest.mark.vcr marker so one record sweep populates cassettes for every marked test across all providers, instead of forcing each test to bind to a hard-coded cassette path. Changes vs. the initial scaffolding: - Add 'pytest-recording==0.13.4' on top of vcrpy. Adopt its layout: cassettes live at 'cassettes/<test_module>/<test_name>.yaml', resolved automatically. New tests just decorate with '@pytest.mark.vcr' — no imports or path bookkeeping. - Move the shared filter/match config into a 'vcr_config' fixture in 'tests/llm_translation/conftest.py' (consumed by pytest-recording for every marked test in the dir). Drop the standalone 'vcr_config.py'. - Bulk record / replay via the standard '--record-mode' CLI flag: 'make test-llm-translation-record' now sweeps every '@pytest.mark.vcr' test under tests/llm_translation in one shot. Optional 'TARGET=' var scopes to a single file. - Move existing cassettes to the per-test paths and update the local in-process Anthropic regenerator to write to the same paths. - Refresh README + Makefile target docs to match the sweep workflow. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>	2026-04-30 18:08:57 +00:00
Cursor Agent	94b319c577	tests(llm_translation): add VCR cassette infrastructure for offline replay Live LLM e2e tests have been draining provider billing accounts and going flaky on outages (LIT-2683). This change introduces vcrpy-backed cassette replay so CI can exercise the same end-to-end LiteLLM transformation paths without hitting the live provider: - Add 'vcrpy==8.1.1' to the dev dependency group. - New 'tests/llm_translation/vcr_config.py' centralises the VCR config: filters auth/secret headers and per-request response headers, matches on method+URI+body, and exposes 'LITELLM_VCR_RECORD_MODE' for re-recording. - New 'tests/llm_translation/test_anthropic_completion_vcr.py' demonstrates the pattern with one non-streaming and one streaming Anthropic test that replay from cassettes shipped under 'cassettes/'. - New 'tests/llm_translation/cassettes/_record_anthropic_fixtures.py' lets contributors regenerate the canned Anthropic cassettes against a local in-process mock (no API key required), and 'cassettes/README.md' documents the full record/replay/refresh workflow. - New 'make test-llm-translation-record FILE=...' Makefile target to refresh cassettes against the live API. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>	2026-04-30 00:45:50 +00:00
Yuneng Jiang	f8bb29aebf	bump: version 1.83.14 → 1.84.0	2026-04-28 17:43:17 -07:00
Yuneng Jiang	4ec3ac28a6	bump: version 1.83.13 → 1.83.14	2026-04-25 19:32:21 -07:00
Yuneng Jiang	92c1e9b63c	bump: version 0.1.38 → 0.1.39	2026-04-25 19:31:49 -07:00
Yuneng Jiang	67628a60c3	bump: version 0.4.68 → 0.4.69	2026-04-25 19:30:33 -07:00
user	4d74a30412	chore(deps): fix brace-expansion pin and revert risky dev bumps - Dockerfile: pin the unscoped `brace-expansion@5.0.5` alongside `@isaacs/brace-expansion@5.0.1`. The scoped package only has 5.0.0 and 5.0.1 published; CVE-2026-33750's fix (5.0.5) is on the unscoped package which npm also vendors. The override loop now swaps both. - Revert `black` 26.3.1 -> 24.10.0, `pytest` 9.0.3 -> 8.3.5, and `pytest-asyncio` 1.3.0 -> 1.2.0. The major-version bumps cause CI lint (black reformats hundreds of files) and code-quality (liccheck.ini has no entry for the new versions) failures. Both CVEs are dev-only; skipping leaves no runtime exposure.	2026-04-24 00:37:07 +00:00
user	fed1a14646	chore(deps): bump vulnerable dependencies Closes Nexus IQ policy violations and open Dependabot alerts for shipped Python deps and runtime-stage npm pins in the Docker image.	2026-04-24 00:36:59 +00:00
Yuneng Jiang	29e30d9ddb	bump: version 1.83.12 → 1.83.13	2026-04-23 16:58:17 -07:00
Yuneng Jiang	9f46d838fd	bump: version 1.83.11 → 1.83.12	2026-04-22 18:21:47 -07:00
Yuneng Jiang	3ddb3cbdf6	bump: version 0.4.67 → 0.4.68	2026-04-22 18:20:21 -07:00
yuneng-jiang	e3ed136f52	Merge pull request #26209 from BerriAI/yj_bump_apr21_2 [Infra] Bump version	2026-04-21 18:29:41 -07:00
Yuneng Jiang	5837d4a9ac	bump: version 1.83.10 → 1.83.11	2026-04-21 18:10:31 -07:00
SwiftWinds	11b776935d	chore: make `uv` newer than 0.10 allowable	2026-04-21 11:39:11 -07:00
ishaan-berri	2f22a1293e	bump litellm-proxy-extras to 0.4.67 (#26043 ) * bump litellm-proxy-extras version to 0.4.67 * bump litellm-proxy-extras pin to 0.4.67 in litellm pyproject * regenerate uv.lock for litellm-proxy-extras 0.4.67 * bump litellm-enterprise version to 0.1.38 * bump litellm-enterprise pin to 0.1.38 in litellm pyproject * regenerate uv.lock for litellm-enterprise 0.1.38	2026-04-18 19:03:56 -07:00
Yuneng Jiang	4d63a1367e	bump: version 1.83.9 → 1.83.10	2026-04-18 18:31:24 -07:00
Yuneng Jiang	9bdb3b1772	chore: lower python floor from 3.11 to 3.10 All three dependency bumps in this PR resolve on Python 3.10, so there is no need to jump the floor all the way to 3.11. Also restore the py3.10-specific lunary==1.4.36 pin that was collapsed when the floor was temporarily at 3.11.	2026-04-18 12:50:04 -07:00
Yuneng Jiang	d1e665742b	chore: drop stale python_version markers after floor raise Now that requires-python starts at 3.11, the "python_version >= '3.9'" and ">= '3.10'" markers are unconditionally true, and the "< '3.10'" entries for psycopg, Pillow, pyarrow, langchain, lunary, and pylint can never resolve. Drop the dead markers and remove the unreachable pins so the dependency list reflects what actually gets installed.	2026-04-18 12:31:53 -07:00
Yuneng Jiang	1c29c5e903	chore: bump proxy deps and raise python floor to 3.11 Bumps orjson, fastapi-sso, and python-multipart to their latest releases in the proxy extra, and raises the project python floor to 3.11 so the updated pins can resolve. CI already runs on 3.11 / 3.12 / 3.13 and the Docker images ship python 3.13, so the floor change aligns the declared support range with what is actually tested and shipped.	2026-04-18 12:16:35 -07:00
yuneng-jiang	bf7b7f7f60	Merge pull request #25872 from BerriAI/yj_bump_apr16_2 bump: version 1.83.8 → 1.83.9	2026-04-16 17:56:44 -07:00
yuneng-jiang	f07aadc3f9	Merge pull request #25873 from BerriAI/yj_extras_bump_apr16 bump: proxy extras version 0.4.65 → 0.4.66	2026-04-16 17:56:33 -07:00

1 2 3 4 5 ...

1891 Commits