litellm

Author	SHA1	Message	Date
yuneng-jiang	7899463c6a	fix(callbacks): forward callback_settings to callback initializers and guard consumers against non-dict values (#30161 ) * fix(datadog): pass callback_specific_params so DatadogCostManagementLogger receives cost_tag_keys (#29590) * fix(datadog): pass callback_specific_params so DatadogCostManagementLogger receives cost_tag_keys * test(proxy): regression test that load_config forwards callback_specific_params * fix(proxy): guard lakera_prompt_injection callback_specific_params against non-dict Addresses review feedback: forwarding callback_settings as callback_specific_params (so DatadogCostManagementLogger receives cost_tag_keys) exposed the lakera_prompt_injection branch, which did lakeraAI_Moderation(callback_specific_params ["lakera_prompt_injection"]) with no type guard. A config like `callback_settings: {lakera_prompt_injection: "any-string"}` then hit `"any-string"` -> TypeError: argument after ** must be a mapping, not str. Guard the lakera branch with isinstance(dict), matching the existing presidio and datadog_cost_management branches (non-dict values fall back to {}). Add a regression test asserting initialize_callbacks_on_proxy ignores a non-dict value instead of crashing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: inject fake lakera_ai module to avoid importing the real one CI fix for the lakera regression test: it stubbed litellm.proxy.proxy_server with a SimpleNamespace and then monkeypatch.setattr'd the real lakera_ai module, which forces importing it — and lakera_ai does `from litellm.proxy.proxy_server import LiteLLM_TeamTable`, absent on the stub -> ImportError under proxy-infra tests. Inject a fake lakera_ai module into sys.modules instead, so the callbacks branch's `from ...lakera_ai import lakeraAI_Moderation` resolves to the stub without loading the real module. The guard under test (isinstance(dict) in the lakera branch) is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(callbacks): guard compression/websearch interceptors against non-dict callback_settings (#30153) #29590 forwards the full callback_settings dict into initialize_callbacks_on_proxy, which activates the compression_interception and websearch_interception consumers. Their initialize_from_proxy_config read the callback_settings subkey without an isinstance(dict) guard, so a non-dict value such as `compression_interception: true` reached from_config_yaml(...).get(...) and aborted proxy startup with AttributeError. #29590 added that guard for lakera_prompt_injection but not for these two Mirror the isinstance(dict) guard already used by the lakera, presidio, and datadog branches so a non-dict value is ignored and the callback initializes with defaults. A parametrized test feeds every callback_settings consumer a non-dict value through initialize_callbacks_on_proxy to catch a future consumer that forgets the guard * fix(callbacks): normalize non-dict callback_specific_params to empty dict A blank callback_settings: key in YAML loads as None, and config.get('callback_settings', {}) returns None because dict.get only falls back to the default when the key is absent. Forwarding that value verbatim to initialize_callbacks_on_proxy made the first '<name>' in callback_specific_params membership test raise TypeError: argument of type 'NoneType' is not iterable, aborting proxy startup. Same failure for any non-dict root such as callback_settings: true. Normalize the value at the function boundary so both callsites (and any future ones) initialize callbacks with their defaults instead of crashing. --------- Co-authored-by: Hedi Daoud <150018939+hdaoud23@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-10 15:22:00 -07:00
Mateo Wang	20e453f698	feat(cli): per-agent `lite claude` / `codex` / `opencode` commands that wrap coding agents through the proxy (#29850 ) * feat(cli): add `litellm-proxy run -- <agent>` to wrap coding agents through the proxy Wraps Claude Code, Codex, OpenCode, and any other coding agent so all of its LLM traffic routes through a LiteLLM proxy, with the agent-vault style of "just works" DX: one `run -- <agent>` command, auto SSO login when interactive, env-key "agent mode" for containers/CI, and a fail-fast key check against the proxy so bad credentials error immediately instead of deep inside the agent. The wrapped binary is detected by name to pick the right variables. Claude Code gets ANTHROPIC_BASE_URL (the bare proxy root, so it appends /v1/messages) and ANTHROPIC_AUTH_TOKEN, with any stray ANTHROPIC_API_KEY cleared so the proxy token wins. Codex and OpenCode get OPENAI_BASE_URL (proxy + /v1) and OPENAI_API_KEY. Unrecognized commands get both sets so they work either way. `litellm-proxy claude-code` remains as a shortcut for `run -- claude`. The core logic is split into dependency-injected helpers (agent_profile, build_agent_env, verify_proxy_key, run_agent) so env wiring, the preflight, and the launch handoff are unit-tested without monkeypatching, alongside CliRunner tests for auth resolution, agent mode, and auto-login. Mutation-tested the env profiles, preflight, and agent-mode branch to confirm the tests fail when the behavior is broken. https://claude.ai/code/session_0154VpLXW7mMvk5wfbgPRJa6 * Make each coding agent its own litellm-proxy command Replace the `run -- <agent>` interface and the `claude-code` shortcut with top-level commands generated per known agent, so launching is just `litellm-proxy claude`, `litellm-proxy codex`, or `litellm-proxy opencode`, with everything after the agent name forwarded straight to it. This drops the ceremony of `run --` and cuts typing. The `--model`/`--small-fast-model` wrapper flags are gone; pass the agent's own model flag instead, or export the model env vars (the wrapper preserves what you already have set), which keeps the surface minimal and avoids intercepting flags the agent owns. Rename the module to agents.py to match. * fix(cli): route `litellm-proxy codex` through the proxy via a custom provider Codex ignores OPENAI_BASE_URL (it always dials api.openai.com over the Responses WebSocket transport), so the OpenAI env profile alone left `litellm-proxy codex` talking to OpenAI directly instead of the proxy. Point Codex at the proxy with a custom provider passed as `-c` config overrides, and force the HTTP/SSE Responses transport with supports_websockets=false since the proxy does not speak the Responses WebSocket protocol. The provider reads its key from OPENAI_API_KEY, which the agent env already exports. The overrides are injected ahead of the user's args so they precede Codex's subcommand. Claude Code and OpenCode are unaffected; they honor the exported env vars. Adds regression tests for the per-agent launch args and the injection ordering. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * Rename litellm-proxy CLI command to lite The proxy management CLI was invoked as litellm-proxy, which is a lot to type for an everyday command. Rename the console script entry point to lite and update the in-CLI usage examples, help text, error messages and docs to match. * fix(sso): stop CLI auth success page from hanging on "Closing..." The CLI opens the SSO success page with webbrowser.open, so the tab is not script-opened and the browser refuses window.close(). The countdown would end on "Closing..." and the tab would sit there forever. Drop the countdown and just show "You can now close this window and return to your terminal." from the start, while still attempting window.close() once so the tab auto-closes in the rare case the browser allows it. Add a regression test asserting the manual-close instruction is always present and the misleading countdown/"Closing..." text is gone. * fix(cli): reattach controlling terminal after SSO login, keep litellm-proxy alias When the first `lite claude` has to log in via browser SSO, completing the login could leave stdin detached from the terminal, so a TUI agent like Claude Code would start in non-interactive mode and exit with "Input must be provided". The wrapper now reopens the controlling terminal onto stdin just before handoff when the session started interactively; piped or redirected input is detected up front and left alone, so agent-mode and non-interactive use are unchanged. Also keep the `litellm-proxy` console script as an alias for `lite` so existing scripts and CI that invoke `litellm-proxy` keep working; both names map to the same CLI. * feat(install): make the curl installer need only curl, not a pre-existing Python The installer now lets uv provision a managed Python 3.13 when no suitable interpreter is found, instead of aborting. The minimum is also bumped from 3.9 to 3.10 to match the package's requires-python (>=3.10), so a system Python 3.9 is no longer selected only for uv tool install to reject it. * feat(cli): add thin litellm[cli] install path (install-cli.sh + brew) for the lite CLI On a developer laptop the `lite` CLI only needs `lite login` and running coding agents through a proxy, but the sole install path was `litellm[proxy]`, which drags in the whole server tree (fastapi, uvicorn, boto3, polars, cryptography, litellm-enterprise). The CLI's heavy imports are all guarded, so it runs on the base SDK plus just rich, pyyaml and requests. Add a `cli` extra carrying exactly those three, a `scripts/install-cli.sh` curl one-liner that installs `litellm[cli]`, and a `BerriAI/homebrew-litellm` tap formula with a release runbook under `packaging/homebrew/`. The installer passes no `--python`, so uv honours litellm's requires-python and provisions a managed interpreter, skipping a too-old (3.9) or too-new (3.14+) system Python instead of failing to resolve. A pyproject thin-contract test asserts the `cli` extra keeps the deps the CLI imports and never leaks a server-only dependency from `proxy`, so the laptop install cannot silently re-bloat * fix(install): let uv pick the Python via --python-preference system Both installers detected a system Python with a floor-only check and forced it with `uv tool install --python <interp>`. On a host whose only Python is outside litellm's requires-python (a too-old 3.9 or, increasingly, a too-new 3.14) that forced an incompatible interpreter and the resolve failed. Drop the detection and pass `--python-preference system`: uv reuses a compatible system Python when present and downloads a managed one otherwise, always honouring requires-python * test(router): filter aiohttp unclosed-session gc noise in test_async_fallbacks test_async_fallbacks asserts the last three captured log records are the router's fallback messages. Under the litellm_router_testing job (pytest -k router -n 4) many router tests share the module-level in_memory_llm_clients_cache (max 200, ttl 3600s). Older cached OpenAI/Azure clients get evicted while their aiohttp ClientSession is still open, and when the gc reclaims them aiohttp emits "Unclosed client session"/"Unclosed connector" through the asyncio logger. Those records land in caplog mid-test and push the expected router logs out of the last-three window, so the assertion flips to failing non-deterministically. These warnings are async cleanup noise, not router debug logs, so filter them out exactly like the existing leaked-task warnings before asserting order. The assertion on the three router fallback messages is unchanged. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com>	2026-06-10 13:52:26 -07:00
Mateo Wang	a4a3348801	[internal copy of #28007 ] Fix/gcp model garden streaming (#28363 ) * fix(vertex): stream Model Garden Gemma/Qwen responses correctly through /v1/messages * test(vertex): cover _CombinedChunkSplitter defensive branches * test(databricks): rename test file to avoid duplicate basename collision * fix(databricks,anthropic): defensive token defaults; document single-mode splitter Address greptile P2 concerns: - databricks: default usage token fields to 0 when constructing ChatCompletionUsageBlock from a partially populated usage block — matches the defensive pattern used in ollama/vertex_ai/cohere/bedrock. - _CombinedChunkSplitter: clarify in the docstring that an instance is single-mode (sync or async, not both), since the two iteration paths hold independent upstream iterator references. Co-authored-by: Claude <claude@anthropic.com> --------- Co-authored-by: Steven Kessler <9701252+stvnksslr@users.noreply.github.com> Co-authored-by: Claude <claude@anthropic.com>	2026-06-10 12:31:00 -07:00
Yassin Kortam	410b892f77	fix(register_model): preserve built-in cache pricing when registering custom overrides under unmapped keys (#30044 ) * fix(spend-tracking): fall back to direct spend-counter increment when reservation reconcile fails When the reservation-reconcile path in `_reconcile_budget_reservation_for_counter_update` hits a Redis error, it now correctly returns an empty set so that `increment_spend_counters` re-runs the direct increment for the affected counters. Previously, the function logged the failure, invalidated the reserved counters, and still returned the reserved counter keys, which caused the caller to skip the direct increment. With the increment skipped and the counter deleted, the next request reseeded the counter from `LiteLLM_VerificationToken.spend`, a column the batched flusher only updates every few seconds, so the enforced cross-pod spend value collapsed to a stale snapshot and budget gating stopped firing for affected keys. Adds a regression test that exercises the failure path with a flaky redis backend and asserts the actual response cost lands in the shared counter. * fix(register_model): preserve built-in cache pricing when registering custom overrides under unmapped keys When a custom-priced model is registered under a key shape that get_model_info cannot resolve (e.g. litellm_params.model set to bedrock/bedrock/us.anthropic.claude-sonnet-4-6 or another non-canonical alias), register_model previously fell back to an empty existing_model. The merged entry then carried only the fields the user set explicitly (input/output cost, provider) and dropped cache pricing. Downstream the cost calculator defaulted cache_creation_input_token_cost and cache_read_input_token_cost to 0, silently dropping the bulk of the bill for cache-heavy Anthropic traffic. register_model now attempts to resolve a canonical built-in entry by stripping provider prefixes, region prefixes, and provider-specific suffixes before giving up. When a variant resolves, its defaults (notably cache pricing) are inherited while the user's explicit overrides still win. When nothing resolves and the user supplied no cache pricing, it logs a warning instead of silently under-billing. * fix(router): inherit built-in cache pricing on deployments with partial custom pricing A deployment configured with only input_cost_per_token and output_cost_per_token under model_info was being registered under its model_info.id with no cache cost fields. The cost calculator then defaulted cache_creation_input_token_cost and cache_read_input_token_cost to 0, silently billing cache_read and cache_creation tokens at zero. For cache-heavy Anthropic traffic this drops the bulk of the bill. When the deployment's litellm_params.model resolves to a built-in cost-map entry, pull the cache pricing fields from there before registering. User-specified cache fields still win on merge; only missing fields are inherited. Pairs with the register_model fallback added earlier in this branch: that handles unmapped key shapes like bedrock/bedrock/x, this handles deploy-id keys whose backend model is mapped. * fix(register_model): inherit only cache pricing on unmapped-key fallback, not provider The unmapped-key fallback in register_model copied the entire resolved built-in entry, so registering openai/command-r-plus inherited the cohere built-in's litellm_provider and get_model_info(custom_llm_provider=openai) could no longer resolve it. Restrict the fallback to the cache-pricing fields, matching the router-side _inherit_builtin_cache_pricing, so the cache-cost dropout stays fixed without clobbering the registered provider. Add a direct unit test for Router._inherit_builtin_cache_pricing so the router coverage check sees it, and pin the fixed spend-counter contract: when reservation reconcile fails the counter must hold the directly incremented cost rather than being left at None.	2026-06-10 12:11:03 -07:00
michelligabriele	f9293d40c4	fix(proxy): self-heal startup/reload prisma reads on engine disconnect (#28803 )	2026-06-10 20:16:58 +02:00
Sameer Kankute	3b40ac987f	Litellm oss 090626 (#30021 ) * fix(mcp): report scoped server name during initialize (#29865) * fix mcp scoped server name * Update litellm/proxy/_experimental/mcp_server/mcp_context.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * test(mcp): cover scoped server name in the SSE initialize handler --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix(ui): show all session logs in the drawer, not just the first 50 (#29795) * fix(ui): show newest session logs first * test(ui): keep session log pagination coverage * fix(ui): show all session logs in the drawer, not just the first page The session detail drawer fetched session logs via sessionSpendLogsCall without page/page_size, so it only ever received the backend default of one page (50 rows). Sessions with more than 50 calls had the rest unreachable in the UI (#29153). sessionSpendLogsCall now takes page/page_size, and the drawer fetches the first page, reads total_pages, then fetches the remaining pages and accumulates them before the existing client-side sort. This keeps the single continuous list (and the selected-log lookup and keyboard navigation, which all assume the full session) correct. Fetching is bounded by a page cap, and the sidebar shows a "showing most recent N" note if a session exceeds it. The rows are lightweight metadata (the endpoint excludes messages/response), so the full set is small; request/response bodies are still loaded per log on demand. * fix(ui): default session drawer to most recent log, newest first Open a session with its most recent log selected, and order the sidebar newest-first to match the all-sessions logs overview. MCP calls stay grouped last. The latest log by time is computed explicitly, since the MCP grouping means it is not always the first row. * Apply fetching pages in batches suggestion from @greptile-apps[bot] Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix(ui): derive session total from accumulated rows when backend omits it Compute the session total after all pages are fetched, falling back to the accumulated row count rather than the first page's. Guards the truncation note against a backend response that omits total but spans multiple pages. --------- Co-authored-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix(proxy): handle Mistral multipart passthrough (#29927) * fix(proxy): handle Mistral multipart passthrough * chore: satisfy passthrough ci formatting * test(proxy): cover Mistral passthrough in CI shard * fix(vertex_ai): use REP host for context caching on eu/us multi-region endpoints (#29573) Context caching built the cachedContents URL as https://{location}-aiplatform.googleapis.com, which is an invalid host for the eu/us multi-region endpoints and returns 404. The inference path already resolves these to the REP host (https://aiplatform.{geo}.rep.googleapis.com) via get_vertex_base_url(); reuse that helper in _get_token_and_url_context_caching so caching uses the same host as inference. Adds tests covering the eu/us multi-region cachedContents URLs (v1 and v1beta1). Fixes #29571 * Support per-model encrypted content affinity config (#29760) Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> * fix: propagate upstream status code in proxy API exception handler (#29402) * fix: propagate upstream status code in proxy API exception handler When Google GenAI / Vertex returns a 404 for deprecated or missing models via streamGenerateContent, the exception was falling through to a generic handler that defaulted to 500. Now provider exceptions carrying a valid HTTP status_code correctly propagate it through to the ProxyException. * fix: apply black formatting to common_request_processing.py * fix: tighten status code range to 400-599 and deduplicate ProxyException raise * fix(tests): use valid vertex_location in context caching tests Replace "test_location" (contains underscore) with "us-central1" so tests pass the regex validation added in get_vertex_base_url(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(sdk): add xAI OAuth provider (#29866) * Add xAI OAuth provider * Update oauth.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Fix xAI OAuth CI failures * Add xAI OAuth coverage tests * Move xAI OAuth coverage tests to core utils * Address xAI OAuth review comments * Prevent xAI OAuth api_base token exfiltration * Treat blank xAI OAuth api keys as absent * Wrap invalid xAI OAuth JSON responses * Use xAI OAuth behind explicit flag --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix(proxy) #27734 allow clearing budget_duration and team_member fields by sending null on /key/update and /team/update (#27751) * fix(proxy): allow clearing budget_duration and team_member fields by sending null on /key/update and /team/update Fixes #27734 Sending null for budget_duration, team_member_budget, team_member_budget_duration, team_member_rpm_limit, or team_member_tpm_limit via /key/update or /team/update returned 200 OK but silently ignored the null value. The fields remained unchanged in the database. Root causes: - /key/update: prepare_key_update_data() popped budget_duration from the update dict but never re-added it (or budget_reset_at) when the value was None. - /team/update: _set_budget_reset_at() only acted when budget_duration was non-None, leaving a stale budget_reset_at in the DB. - /team/update: team_member_* null values bypassed the budget table update entirely because should_create_budget() requires at least one non-None field. * test(proxy): cover no-budget-row path in clear_team_member_budget_fields * fix(presidio): unmask PII tokens in Anthropic native SSE streaming bytes (#30028) * fix(presidio): unmask PII tokens in Anthropic native SSE streaming bytes When output_parse_pii=true on the Anthropic native path (anthropic/claude-), response chunks arrive as raw bytes in SSE format. _stream_pii_unmasking was yielding those bytes unchanged, so <PERSON_1> tokens were never replaced with the original values before reaching the caller. Add _unmask_sse_bytes_chunk to parse each data: line, find content_block_delta / text_delta events, and apply _unmask_pii_text before re-encoding. Wire it into _stream_pii_unmasking so bytes chunks are unmasked when pii_tokens exist. fix(presidio): handle CRLF line endings and non-ASCII PII in SSE unmask Strip trailing \r before the [DONE] guard so CRLF-terminated SSE chunks don't bypass it and silently swallow a JSONDecodeError. Add ensure_ascii=False to json.dumps so non-ASCII replacement values like accented names are preserved as UTF-8 on the wire rather than being \uXXXX-escaped. Add regression tests for both cases. * feat(bedrock_mantle): path-aware Responses routing (/v1/responses vs /openai/v1/responses) (#29925) * feat(bedrock_mantle): path-aware Responses routing (/v1/responses vs /openai/v1/responses) Bedrock Mantle serves the Responses API on two upstream paths: - gpt frontier models (gpt-5.5 / gpt-5.4) on /openai/v1/responses - every other Responses-capable model (e.g. gpt-oss) on the standard /v1/responses BedrockMantleResponsesAPIConfig gains a `use_openai_path` flag; the provider gate in utils.py picks the path per model: openai.gpt-* (non gpt-oss) -> /openai/v1/responses; any model declared mode=responses (price-map entry or user model_info) -> /v1/responses; everything else returns None and keeps the existing chat-completions emulation. Adds gpt-5.5 / gpt-5.4 price-map entries, registry wiring, and the routing-matrix tests. * feat(bedrock_mantle): data-driven frontier routing via use_openai_responses_path Addresses the Greptile review point that frontier detection should be a price-map field rather than a hardcoded name match. The gate now routes a model to /openai/v1/responses when its price-map entry declares use_openai_responses_path, so a frontier model whose name does not follow the openai.gpt- convention can be onboarded by JSON alone. The name-convention check is kept as a fallback that needs no price-map entry, which preserves zero-change routing for a future gpt-6 before its entry loads. gpt-5.5 / gpt-5.4 get the flag in both price maps. Adds tests for the data-driven flag path and for the flag presence on the gpt-5.x entries; both branches are mutation-tested. * test(model_prices): allow use_openai_responses_path in price-map schema The model_prices_and_context_window.json schema validator (test_aaamodel_prices_and_context_window_json_is_valid) enforces additionalProperties: false, so the new use_openai_responses_path flag on the gpt-5.5 / gpt-5.4 entries failed validation. Add it to the schema as a boolean, alongside the other supports_* / capability flags. * Add Tensormesh serverless models to the model cost map (#30037) * Add Tensormesh serverless models to the model cost map * Flag reasoning support on the Tensormesh models that expose thinking mode * fix(proxy): invalidate stale key spend counter after budget reset or manual spend update (#30001) * fix(proxy): reconcile stale key spend counter after budget reset * fix(proxy): invalidate stale key spend counter after budget reset or manual spend update * fix(proxy): remove read-time stale counter reconciliation to prevent budget bypass * revert: undo unrelated formatting changes in enterprise directory * test(proxy): add unit test for key spend update invalidating counter * test(proxy): fix mocked update_data and hash token expectations in unit test * fix(proxy): use Responses-API transformer in pass-through cost tracking (#29728) The `elif is_responses:` branch of `openai_passthrough_handler` was calling the chat-completions `transform_response` on a Responses API payload. The chat-completions transformer expects `choices: [...]` in the raw response; the Responses API uses `output: [...]` and `usage.input_tokens` / `usage.output_tokens` (not `prompt_tokens` / `completion_tokens`). The result was a KeyError 'choices' deep inside `convert_to_model_response_object`, swallowed by the surrounding `except Exception` in the handler, and the SpendLogs row was written by the fallback path with zeroed-out tokens, spend, and model. This bug silently undercounts cost for every successful pass-through call to either OpenAI's `/v1/responses` or Azure's `/openai/v1/responses` (deployments configured for the Responses API). Reproduced 2026-06-04 against a real Azure OpenAI Responses API deployment proxied through LiteLLM v1.88.0. Fix: use the dedicated `OpenAIResponsesAPIConfig.transform_response_api_response` for the Responses branch. This transformer already exists in LiteLLM (`litellm/llms/openai/responses/transformation.py`) and knows the Responses-API on-the-wire shape. `litellm.completion_cost` already handles `ResponsesAPIResponse` natively with `call_type="responses"`, so no downstream changes are needed. Tests: test_responses_api_uses_responses_transformer_not_chat_completions NEW. Real regression test — exercises the openai_passthrough_handler with a real-shaped Responses payload (no `choices`, has `output` and Responses-API `usage` keys) and NO mocked `get_provider_config`. Pre-fix: raises KeyError 'choices' inside the chat-completions transformer (the bug). Post-fix: returns a ResponsesAPIResponse, completion_cost is called with call_type="responses" and a ResponsesAPIResponse instance (asserted). Verified to fail on un-fixed handler + pass on fixed handler before commit. test_responses_api_cost_tracking UPDATED. Old test mocked `get_provider_config` (no longer called in the responses branch post-fix). Now mocks the Responses transformer directly (`OpenAIResponsesAPIConfig.transform_response_api_response`) to test the downstream cost-calc contract. Out of scope for this PR (separate followup): - Recognizing .cognitiveservices.azure.com (the newer Azure OpenAI hostname) in the is_openai__route checks. Separate PR. Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> * fix(skills): execute DB skills by matching the litellm_skill_ tool name prefix (#30116) Skill IDs are generated as litellm_skill_<uuid> and the model-facing tool name is the sanitized skill ID, but the post-call execution gates in SkillsInjectionHook only ran tools whose name starts with "skill_", so DB skills were silently returned to the client as raw tool calls. Fixes #28122. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(anthropic): synthesize content_block_start when Responses stream omits output_item.added (#30115) * fix(team): reserve team budget raises for proxy admins on /team/update (#30030) The caller's PERSONAL max_budget was the wrong yardstick for /team/update: a team's spend ceiling has nothing to do with the admin's own key budget. That comparison was an unintended side effect of reusing _check_user_team_limits() (which exists for the /team/new path) and broke the UI, which re-sends the unchanged budget on every save. New behavior on /team/update for standalone teams: - A team admin (already authorized via _verify_team_access) may freely KEEP or LOWER the team budget, and change models/tpm/rpm, without being gated by their personal limits. - GROWING a team's spend ceiling is a budget-authority action reserved for proxy admins -> 403 for team admins. "Growing" covers both raising max_budget above the team's current finite value and removing the cap entirely (max_budget=null, detected via model_fields_set so an explicit null is distinguished from an omitted field). For a team that currently has no cap, setting a finite value is a restriction and is allowed. - Org-scoped teams remain governed by _check_org_team_limits() (capped by the org budget). Also reverts the #29525 existing_team_max_budget workaround in _check_user_team_limits() back to the create-only form; /team/new still enforces the creator's personal caps. docs(access_control): resolve the contradiction in the team-admin section — team admins can keep/lower the budget and manage rate limits/models, but cannot raise the team budget (proxy-admin only). tests: unit + behavior coverage for raise-blocked, cap-removal-blocked (team admin), raise/removal allowed (proxy admin), uncapped-team restriction allowed, keep/lower/resend allowed, and unchanged create-path guards. Co-authored-by: Cursor <cursoragent@cursor.com> * test(ui): data-driven App Router migration E2E smoke (default + server-root-path) (#29974) * test(ui): add a data-driven App Router migration E2E smoke Add a growing Playwright smoke for migrated pages: for each segment it deep-links to the path route, asserts the URL and that the dashboard shell rendered, then clicks off to a legacy page and asserts navigation still works. Driven by e2e_tests/fixtures/migratedPages.ts, so adding a page is one line. Runs in two situations against the same proxy: the default mount (npm run e2e:migration) and a non-root SERVER_ROOT_PATH mount (npm run e2e:migration:root). globalSetup now logs in at `${SERVER_ROOT_PATH}/ui/login` so the admin storage state is valid under a prefix. Seeded with api-reference; append the rest as their migrations merge. * test(ui): support headed slow-motion + watch pauses in the migration smoke Honor SLOWMO in the server-root-path config (the default config already did), and add an env-gated E2E_WATCH_MS pause so a headed run lingers on each state. Both are no-ops by default, so CI behavior is unchanged. * test(ui): make the migration smoke a sidebar-click user journey Rework the smoke from deep-linking to a real navigation journey: start at the landing page, click the migrated page in the sidebar (expanding submenus for nested items), assert the path route rendered, reload it (the check a wrong server_root_path breaks), bounce to a legacy page and back, and — once two pages are migrated — navigate directly between two migrated pages. Verifies via URL + shell render, driven by the same fixture list. * test(ui): address review on the migration smoke Escape ROOT and segment before interpolating them into RegExp URL matchers so a future segment containing regex metacharacters can't silently widen the match. Make the server-root-path config fail fast when SERVER_ROOT_PATH is unset instead of silently re-running the default mount and passing without exercising the prefix. * test(ui): drop unused watch helper and fix stale smoke README * test(ui): run the migration smoke under a server root path in CI * test(ui): harden + instrument the server-root-path proxy reboot in CI * test(ui): run the server-root-path migration smoke as its own CI job Replace the in-place proxy reboot in e2e_ui_testing with a dedicated e2e_ui_testing_server_root_path job that boots the proxy once with SERVER_ROOT_PATH=/litellm, matching how every other proxy variant in the config gets its own job rather than killing and relaunching the live proxy. The reboot was failing deterministically: after pkill -9 and relaunch the prefixed proxy never came back up on :4000 (connection refused), so the smoke never ran. The readiness step that was supposed to surface the cause could never reach its boot-log tail because CircleCI runs steps under bash -eo pipefail and the preceding `curl -sv ... \| tail` aborted the step with curl's exit 7. Booting the proxy as the job's own background step lets any boot crash land in that step's log instead of being swallowed. The default e2e_ui_testing job is unchanged aside from dropping the reboot, prefixed-readiness, and prefixed-smoke steps; the migration smoke still runs at the root mount there via the default Playwright config. * fix(proxy): extend response headers hook to streaming, TTS, image gen, and pass-through (#24232) * fix(proxy): extend response headers hook to streaming, TTS, image gen, and pass-through * test: mock post_call_response_headers_hook in audio speech route tests * chore(ui): remove dead App Router route stubs under (dashboard) (#30045) models-and-endpoints, organizations, and virtual-keys each had a page.tsx route under (dashboard)/ that is not in MIGRATED_PAGES, so the sidebar and deep links never resolve to it and the route is unreachable. Each was a thin wrapper that handed the shared view empty or no-op props (empty modelData with a no-op setModelData, hardcoded empty organizations, no-op setUserRole/setUserEmail), so reaching one would render a degraded page in any case. The real wrapper belongs in the PR that flips each page into MIGRATED_PAGES, written with eyes on it and a test This continues the dead-scaffolding cleanup from #28891. The shared components these wrappers rendered (ModelsAndEndpointsView, OrganizationFilters) stay, since the legacy ?page= switch in app/page.tsx and src/components still import them * fix(ui/mcp): reset OAuth state on create-server modal close so a prior server's token no longer leaks into the next add-server session (#30000) * fix(ui/mcp): reset OAuth hook state on modal close so a prior server's token no longer leaks into the next add-server session * fix(ui/mcp): clear in-flight OAuth guard on reset and reset form/tools on modal close so nothing leaks on a parent-driven dismiss * fix(mcp): allow team access-group grants in OAuth authorize/token access check (#30041) * fix(mcp): honor team access-group grants in OAuth authorize/token access check * test(mcp): mock build_effective_auth_contexts in non-admin authorize tests for isolation * docs(security): require a reproduction video for vulnerability reports (#30048) (#30063) With AI models capable of automated vulnerability discovery now publicly available, we expect a large increase in report volume, much of it unverified. Requiring a video of the exploit running against a live instance raises the bar for submissions and keeps triage focused on reproducible issues. Reports without a video will be closed and reopened if one is added later. Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com> * feat(ui): add admin flag to disable in-product UI nudges for everyone (#29796) * feat(ui): add admin flag to disable in-product UI nudges for everyone Admins can now suppress the survey and Claude Code feedback popups for all users via a single disable_ui_nudges UI setting, instead of relying on each user dismissing them individually. * fix(ui): suppress nudges while ui settings are loading Gate nudgesDisabled on the ui-settings loading state so an admin with disable_ui_nudges on doesn't see the survey prompt flash, and the getInProductNudgesCall fetch doesn't fire, on a cold page load before the flag resolves. Falls back to showing nudges if the fetch errors. * test(ui): wrap CreateKeyPage test in QueryClientProvider page.tsx now calls useUISettings (react-query), which needs a QueryClient that layout.tsx supplies in production but the test did not. Add the provider and mock getUiSettings so the query resolves. * chore(ui): remove dead dashboard files and unused dependencies (#30047) * chore(ui): remove dead dashboard files and unused dependencies knip flagged seven orphaned source/config files with no importers and five declared dependencies that nothing in the tree uses. Removing them shrinks the dashboard bundle's source surface and keeps the manifest honest; vite stays installed transitively via vitest, so test tooling is unaffected. * fix(ci): restore serverRootPath.config.ts referenced by SERVER_ROOT_PATH workflow The dead-code sweep removed e2e_tests/serverRootPath.config.ts, but its spec (tests/login/serverRootPathRedirect.spec.ts) and the test_server_root_path.yml workflow step still depend on it, so the redirect e2e job failed to load a config that no longer existed. * fix(proxy): authorize batch files using upload target_model_names (LIT-3593) (#30009) * fix(proxy): authorize batch files using upload target_model_names (LIT-3593) After replace_model_in_jsonl, body.model is a stripped provider id. Reverse-mapping it via resolve_model_name_from_model_id is first-match on model_list and caused false 403s when multiple deployments share the same stripped name. Use target_model_names from the unified file id instead. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(proxy): restore resolve_model_name_from_model_id for JSONL fallback path (LIT-3593) Restores the reverse-lookup for the JSONL body.model fallback path so that legacy/pre-target_model_names managed files still map stripped provider IDs back to proxy aliases before auth. Also cleans up redundant `or None`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert "fix(proxy): restore resolve_model_name_from_model_id for JSONL fallback path (LIT-3593)" This reverts commit 30d2e96f77ef521ccaaf2193fe554980380eb669. --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Add Claude Fable 5 across Anthropic, Bedrock, Vertex AI, and Azure AI (#30064) * Add Claude Fable 5 across Anthropic, Bedrock, Vertex AI, and Azure AI Adds cost map entries for claude-fable-5 ($10/$50 per MTok, 1M context, 128K output, adaptive thinking only) on the Anthropic API, Bedrock converse (base, global, and us/eu geo inference profiles at the 10% regional premium), Vertex AI, and Azure AI (Microsoft Foundry, which serves Fable 5 with the full 1M context window unlike Opus 4.8). Registers anthropic.claude-fable-5 in BEDROCK_CONVERSE_MODELS, lists the model in the setup wizard, and extends the reasoning effort e2e grid. The Bedrock, Vertex, and Azure grid cells carry fail_reason markers until the CI accounts are provisioned: Bedrock needs the provider data sharing opt-in Fable 5 requires, and the Foundry resource needs a claude-fable-5 deployment. The first-party entry carries provider_specific_entry {us: 1.1} for the inference_geo premium and deliberately no fast multiplier since Fable 5 has no fast mode. https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm * Drop removed sampling params for Claude 4.7+ when drop_params is set Fable 5, Opus 4.7, and Opus 4.8 removed sampling params: the API rejects top_p, top_k, and any temperature other than 1 with a 400. LiteLLM was forwarding them even with drop_params enabled because the Anthropic and Bedrock converse transformations passed temperature/top_p through unconditionally. Mirror the GPT-5/o-series handling: temperature=1 still passes through, other values and any top_p are dropped when drop_params is set, and without drop_params a clean client-side UnsupportedParamsError tells the caller how to opt in, instead of surfacing the raw provider error. https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm * Drive sampling param gating from the cost map and cover top_k Greptile review follow-ups on the sampling param fix: the restriction for Fable 5 / Opus 4.7 / 4.8 is now declared as supports_sampling_params: false on every affected cost map entry (perplexity excluded; that route is OpenAI-compatible and maps sampling params upstream) and read back through a tri-state map lookup, keeping the name check only as a fallback for provider-routed ids whose hosted map entries predate the flag, the same layering supports_adaptive_thinking uses. top_k bypasses map_openai_params as a provider-specific kwarg, so it is gated at the shared AnthropicConfig.transform_request boundary (direct, Bedrock invoke, Vertex, Azure) and in the Bedrock converse _handle_top_k_value path, with drop_params threaded through the converse transform helpers. Also updates the reasoning effort grid cell count assertion for the four Fable 5 rows added on this branch (29 x 11 cells). https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm * Declare supports_sampling_params in the cost map schema The model map validation schema uses additionalProperties: false, so the new flag must be declared for the 28 entries that carry it; this was the one failing job (misc / Run tests) on the previous commit. https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm * fix(bedrock): gate top_k=0 on converse to match Anthropic boundary Truthiness check let top_k=0 silently disappear on models that removed sampling params, while AnthropicConfig.transform_request treats 0 as present and raises UnsupportedParamsError (or drops when drop_params is set). Switch to 'is not None' so converse, direct Anthropic, invoke, Vertex, and Azure all behave the same for top_k=0. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> * fix(anthropic): avoid index -1 content_block_delta in messages stream When a /v1/messages request is routed through the Responses API adapter, AnthropicResponsesStreamWrapper only emits content_block_start on response.output_item.added. Some upstreams (LMStudio for example) never send that event, so the text delta handler fell back to _current_block_index, which starts at -1, and clients received content_block_delta events with index -1 and no preceding content_block_start. Anthropic SDKs then fail with "text part -1 not found" The text delta handler now synthesizes a content_block_start with a fresh block index whenever the delta references an unregistered item_id or no block is open yet, and registers the item_id so follow-up deltas reuse the same index Addresses the /v1/messages defect in #27442 * Make test sys.path shim resolve relative to the file, not the CWD os.path.abspath("../../../../../../..") depends on where pytest is invoked from; anchoring on os.path.dirname(__file__) makes the import work from any working directory. Also corrects the depth: the repo root is six levels above this file, not seven. --------- Co-authored-by: milan-berri <milan@berri.ai> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: ryan-crabbe-berri <ryan@berri.ai> Co-authored-by: michelligabriele <gabriele.michelli@icloud.com> Co-authored-by: tin-berri <tin@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com> Co-authored-by: Sameer Kankute <sameer@berri.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com> * fix: enable compact-2026-01-12 beta header for vertex_ai provider (#30114) * fix(team): reserve team budget raises for proxy admins on /team/update (#30030) The caller's PERSONAL max_budget was the wrong yardstick for /team/update: a team's spend ceiling has nothing to do with the admin's own key budget. That comparison was an unintended side effect of reusing _check_user_team_limits() (which exists for the /team/new path) and broke the UI, which re-sends the unchanged budget on every save. New behavior on /team/update for standalone teams: - A team admin (already authorized via _verify_team_access) may freely KEEP or LOWER the team budget, and change models/tpm/rpm, without being gated by their personal limits. - GROWING a team's spend ceiling is a budget-authority action reserved for proxy admins -> 403 for team admins. "Growing" covers both raising max_budget above the team's current finite value and removing the cap entirely (max_budget=null, detected via model_fields_set so an explicit null is distinguished from an omitted field). For a team that currently has no cap, setting a finite value is a restriction and is allowed. - Org-scoped teams remain governed by _check_org_team_limits() (capped by the org budget). Also reverts the #29525 existing_team_max_budget workaround in _check_user_team_limits() back to the create-only form; /team/new still enforces the creator's personal caps. docs(access_control): resolve the contradiction in the team-admin section — team admins can keep/lower the budget and manage rate limits/models, but cannot raise the team budget (proxy-admin only). tests: unit + behavior coverage for raise-blocked, cap-removal-blocked (team admin), raise/removal allowed (proxy admin), uncapped-team restriction allowed, keep/lower/resend allowed, and unchanged create-path guards. Co-authored-by: Cursor <cursoragent@cursor.com> * test(ui): data-driven App Router migration E2E smoke (default + server-root-path) (#29974) * test(ui): add a data-driven App Router migration E2E smoke Add a growing Playwright smoke for migrated pages: for each segment it deep-links to the path route, asserts the URL and that the dashboard shell rendered, then clicks off to a legacy page and asserts navigation still works. Driven by e2e_tests/fixtures/migratedPages.ts, so adding a page is one line. Runs in two situations against the same proxy: the default mount (npm run e2e:migration) and a non-root SERVER_ROOT_PATH mount (npm run e2e:migration:root). globalSetup now logs in at `${SERVER_ROOT_PATH}/ui/login` so the admin storage state is valid under a prefix. Seeded with api-reference; append the rest as their migrations merge. * test(ui): support headed slow-motion + watch pauses in the migration smoke Honor SLOWMO in the server-root-path config (the default config already did), and add an env-gated E2E_WATCH_MS pause so a headed run lingers on each state. Both are no-ops by default, so CI behavior is unchanged. * test(ui): make the migration smoke a sidebar-click user journey Rework the smoke from deep-linking to a real navigation journey: start at the landing page, click the migrated page in the sidebar (expanding submenus for nested items), assert the path route rendered, reload it (the check a wrong server_root_path breaks), bounce to a legacy page and back, and — once two pages are migrated — navigate directly between two migrated pages. Verifies via URL + shell render, driven by the same fixture list. * test(ui): address review on the migration smoke Escape ROOT and segment before interpolating them into RegExp URL matchers so a future segment containing regex metacharacters can't silently widen the match. Make the server-root-path config fail fast when SERVER_ROOT_PATH is unset instead of silently re-running the default mount and passing without exercising the prefix. * test(ui): drop unused watch helper and fix stale smoke README * test(ui): run the migration smoke under a server root path in CI * test(ui): harden + instrument the server-root-path proxy reboot in CI * test(ui): run the server-root-path migration smoke as its own CI job Replace the in-place proxy reboot in e2e_ui_testing with a dedicated e2e_ui_testing_server_root_path job that boots the proxy once with SERVER_ROOT_PATH=/litellm, matching how every other proxy variant in the config gets its own job rather than killing and relaunching the live proxy. The reboot was failing deterministically: after pkill -9 and relaunch the prefixed proxy never came back up on :4000 (connection refused), so the smoke never ran. The readiness step that was supposed to surface the cause could never reach its boot-log tail because CircleCI runs steps under bash -eo pipefail and the preceding `curl -sv ... \| tail` aborted the step with curl's exit 7. Booting the proxy as the job's own background step lets any boot crash land in that step's log instead of being swallowed. The default e2e_ui_testing job is unchanged aside from dropping the reboot, prefixed-readiness, and prefixed-smoke steps; the migration smoke still runs at the root mount there via the default Playwright config. * fix(proxy): extend response headers hook to streaming, TTS, image gen, and pass-through (#24232) * fix(proxy): extend response headers hook to streaming, TTS, image gen, and pass-through * test: mock post_call_response_headers_hook in audio speech route tests * chore(ui): remove dead App Router route stubs under (dashboard) (#30045) models-and-endpoints, organizations, and virtual-keys each had a page.tsx route under (dashboard)/ that is not in MIGRATED_PAGES, so the sidebar and deep links never resolve to it and the route is unreachable. Each was a thin wrapper that handed the shared view empty or no-op props (empty modelData with a no-op setModelData, hardcoded empty organizations, no-op setUserRole/setUserEmail), so reaching one would render a degraded page in any case. The real wrapper belongs in the PR that flips each page into MIGRATED_PAGES, written with eyes on it and a test This continues the dead-scaffolding cleanup from #28891. The shared components these wrappers rendered (ModelsAndEndpointsView, OrganizationFilters) stay, since the legacy ?page= switch in app/page.tsx and src/components still import them * fix(ui/mcp): reset OAuth state on create-server modal close so a prior server's token no longer leaks into the next add-server session (#30000) * fix(ui/mcp): reset OAuth hook state on modal close so a prior server's token no longer leaks into the next add-server session * fix(ui/mcp): clear in-flight OAuth guard on reset and reset form/tools on modal close so nothing leaks on a parent-driven dismiss * fix(mcp): allow team access-group grants in OAuth authorize/token access check (#30041) * fix(mcp): honor team access-group grants in OAuth authorize/token access check * test(mcp): mock build_effective_auth_contexts in non-admin authorize tests for isolation * docs(security): require a reproduction video for vulnerability reports (#30048) (#30063) With AI models capable of automated vulnerability discovery now publicly available, we expect a large increase in report volume, much of it unverified. Requiring a video of the exploit running against a live instance raises the bar for submissions and keeps triage focused on reproducible issues. Reports without a video will be closed and reopened if one is added later. Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com> * feat(ui): add admin flag to disable in-product UI nudges for everyone (#29796) * feat(ui): add admin flag to disable in-product UI nudges for everyone Admins can now suppress the survey and Claude Code feedback popups for all users via a single disable_ui_nudges UI setting, instead of relying on each user dismissing them individually. * fix(ui): suppress nudges while ui settings are loading Gate nudgesDisabled on the ui-settings loading state so an admin with disable_ui_nudges on doesn't see the survey prompt flash, and the getInProductNudgesCall fetch doesn't fire, on a cold page load before the flag resolves. Falls back to showing nudges if the fetch errors. * test(ui): wrap CreateKeyPage test in QueryClientProvider page.tsx now calls useUISettings (react-query), which needs a QueryClient that layout.tsx supplies in production but the test did not. Add the provider and mock getUiSettings so the query resolves. * chore(ui): remove dead dashboard files and unused dependencies (#30047) * chore(ui): remove dead dashboard files and unused dependencies knip flagged seven orphaned source/config files with no importers and five declared dependencies that nothing in the tree uses. Removing them shrinks the dashboard bundle's source surface and keeps the manifest honest; vite stays installed transitively via vitest, so test tooling is unaffected. * fix(ci): restore serverRootPath.config.ts referenced by SERVER_ROOT_PATH workflow The dead-code sweep removed e2e_tests/serverRootPath.config.ts, but its spec (tests/login/serverRootPathRedirect.spec.ts) and the test_server_root_path.yml workflow step still depend on it, so the redirect e2e job failed to load a config that no longer existed. * fix(proxy): authorize batch files using upload target_model_names (LIT-3593) (#30009) * fix(proxy): authorize batch files using upload target_model_names (LIT-3593) After replace_model_in_jsonl, body.model is a stripped provider id. Reverse-mapping it via resolve_model_name_from_model_id is first-match on model_list and caused false 403s when multiple deployments share the same stripped name. Use target_model_names from the unified file id instead. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(proxy): restore resolve_model_name_from_model_id for JSONL fallback path (LIT-3593) Restores the reverse-lookup for the JSONL body.model fallback path so that legacy/pre-target_model_names managed files still map stripped provider IDs back to proxy aliases before auth. Also cleans up redundant `or None`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert "fix(proxy): restore resolve_model_name_from_model_id for JSONL fallback path (LIT-3593)" This reverts commit 30d2e96f77ef521ccaaf2193fe554980380eb669. --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Add Claude Fable 5 across Anthropic, Bedrock, Vertex AI, and Azure AI (#30064) * Add Claude Fable 5 across Anthropic, Bedrock, Vertex AI, and Azure AI Adds cost map entries for claude-fable-5 ($10/$50 per MTok, 1M context, 128K output, adaptive thinking only) on the Anthropic API, Bedrock converse (base, global, and us/eu geo inference profiles at the 10% regional premium), Vertex AI, and Azure AI (Microsoft Foundry, which serves Fable 5 with the full 1M context window unlike Opus 4.8). Registers anthropic.claude-fable-5 in BEDROCK_CONVERSE_MODELS, lists the model in the setup wizard, and extends the reasoning effort e2e grid. The Bedrock, Vertex, and Azure grid cells carry fail_reason markers until the CI accounts are provisioned: Bedrock needs the provider data sharing opt-in Fable 5 requires, and the Foundry resource needs a claude-fable-5 deployment. The first-party entry carries provider_specific_entry {us: 1.1} for the inference_geo premium and deliberately no fast multiplier since Fable 5 has no fast mode. https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm * Drop removed sampling params for Claude 4.7+ when drop_params is set Fable 5, Opus 4.7, and Opus 4.8 removed sampling params: the API rejects top_p, top_k, and any temperature other than 1 with a 400. LiteLLM was forwarding them even with drop_params enabled because the Anthropic and Bedrock converse transformations passed temperature/top_p through unconditionally. Mirror the GPT-5/o-series handling: temperature=1 still passes through, other values and any top_p are dropped when drop_params is set, and without drop_params a clean client-side UnsupportedParamsError tells the caller how to opt in, instead of surfacing the raw provider error. https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm * Drive sampling param gating from the cost map and cover top_k Greptile review follow-ups on the sampling param fix: the restriction for Fable 5 / Opus 4.7 / 4.8 is now declared as supports_sampling_params: false on every affected cost map entry (perplexity excluded; that route is OpenAI-compatible and maps sampling params upstream) and read back through a tri-state map lookup, keeping the name check only as a fallback for provider-routed ids whose hosted map entries predate the flag, the same layering supports_adaptive_thinking uses. top_k bypasses map_openai_params as a provider-specific kwarg, so it is gated at the shared AnthropicConfig.transform_request boundary (direct, Bedrock invoke, Vertex, Azure) and in the Bedrock converse _handle_top_k_value path, with drop_params threaded through the converse transform helpers. Also updates the reasoning effort grid cell count assertion for the four Fable 5 rows added on this branch (29 x 11 cells). https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm * Declare supports_sampling_params in the cost map schema The model map validation schema uses additionalProperties: false, so the new flag must be declared for the 28 entries that carry it; this was the one failing job (misc / Run tests) on the previous commit. https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm * fix(bedrock): gate top_k=0 on converse to match Anthropic boundary Truthiness check let top_k=0 silently disappear on models that removed sampling params, while AnthropicConfig.transform_request treats 0 as present and raises UnsupportedParamsError (or drops when drop_params is set). Switch to 'is not None' so converse, direct Anthropic, invoke, Vertex, and Azure all behave the same for top_k=0. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> * fix: enable compact-2026-01-12 beta header for vertex_ai provider The vertex_ai block in anthropic_beta_headers_config.json mapped compact-2026-01-12 to null, so update_headers_with_filtered_beta stripped the header before the request reached Vertex while the compact_20260112 context edit stayed in the body, and Vertex rejected the request with HTTP 400. Vertex rawPredict accepts the header, and the bedrock and databricks blocks already forward it. Mirrors #21867, which enabled context-1m-2025-08-07 for vertex_ai the same way. Fixes #27290. --------- Co-authored-by: milan-berri <milan@berri.ai> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: ryan-crabbe-berri <ryan@berri.ai> Co-authored-by: michelligabriele <gabriele.michelli@icloud.com> Co-authored-by: tin-berri <tin@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com> Co-authored-by: Sameer Kankute <sameer@berri.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com> * fix(proxy): coerce litellm_settings.max_budget env var to float (#30113) * fix(team): reserve team budget raises for proxy admins on /team/update (#30030) The caller's PERSONAL max_budget was the wrong yardstick for /team/update: a team's spend ceiling has nothing to do with the admin's own key budget. That comparison was an unintended side effect of reusing _check_user_team_limits() (which exists for the /team/new path) and broke the UI, which re-sends the unchanged budget on every save. New behavior on /team/update for standalone teams: - A team admin (already authorized via _verify_team_access) may freely KEEP or LOWER the team budget, and change models/tpm/rpm, without being gated by their personal limits. - GROWING a team's spend ceiling is a budget-authority action reserved for proxy admins -> 403 for team admins. "Growing" covers both raising max_budget above the team's current finite value and removing the cap entirely (max_budget=null, detected via model_fields_set so an explicit null is distinguished from an omitted field). For a team that currently has no cap, setting a finite value is a restriction and is allowed. - Org-scoped teams remain governed by _check_org_team_limits() (capped by the org budget). Also reverts the #29525 existing_team_max_budget workaround in _check_user_team_limits() back to the create-only form; /team/new still enforces the creator's personal caps. docs(access_control): resolve the contradiction in the team-admin section — team admins can keep/lower the budget and manage rate limits/models, but cannot raise the team budget (proxy-admin only). tests: unit + behavior coverage for raise-blocked, cap-removal-blocked (team admin), raise/removal allowed (proxy admin), uncapped-team restriction allowed, keep/lower/resend allowed, and unchanged create-path guards. Co-authored-by: Cursor <cursoragent@cursor.com> * test(ui): data-driven App Router migration E2E smoke (default + server-root-path) (#29974) * test(ui): add a data-driven App Router migration E2E smoke Add a growing Playwright smoke for migrated pages: for each segment it deep-links to the path route, asserts the URL and that the dashboard shell rendered, then clicks off to a legacy page and asserts navigation still works. Driven by e2e_tests/fixtures/migratedPages.ts, so adding a page is one line. Runs in two situations against the same proxy: the default mount (npm run e2e:migration) and a non-root SERVER_ROOT_PATH mount (npm run e2e:migration:root). globalSetup now logs in at `${SERVER_ROOT_PATH}/ui/login` so the admin storage state is valid under a prefix. Seeded with api-reference; append the rest as their migrations merge. * test(ui): support headed slow-motion + watch pauses in the migration smoke Honor SLOWMO in the server-root-path config (the default config already did), and add an env-gated E2E_WATCH_MS pause so a headed run lingers on each state. Both are no-ops by default, so CI behavior is unchanged. * test(ui): make the migration smoke a sidebar-click user journey Rework the smoke from deep-linking to a real navigation journey: start at the landing page, click the migrated page in the sidebar (expanding submenus for nested items), assert the path route rendered, reload it (the check a wrong server_root_path breaks), bounce to a legacy page and back, and — once two pages are migrated — navigate directly between two migrated pages. Verifies via URL + shell render, driven by the same fixture list. * test(ui): address review on the migration smoke Escape ROOT and segment before interpolating them into RegExp URL matchers so a future segment containing regex metacharacters can't silently widen the match. Make the server-root-path config fail fast when SERVER_ROOT_PATH is unset instead of silently re-running the default mount and passing without exercising the prefix. * test(ui): drop unused watch helper and fix stale smoke README * test(ui): run the migration smoke under a server root path in CI * test(ui): harden + instrument the server-root-path proxy reboot in CI * test(ui): run the server-root-path migration smoke as its own CI job Replace the in-place proxy reboot in e2e_ui_testing with a dedicated e2e_ui_testing_server_root_path job that boots the proxy once with SERVER_ROOT_PATH=/litellm, matching how every other proxy variant in the config gets its own job rather than killing and relaunching the live proxy. The reboot was failing deterministically: after pkill -9 and relaunch the prefixed proxy never came back up on :4000 (connection refused), so the smoke never ran. The readiness step that was supposed to surface the cause could never reach its boot-log tail because CircleCI runs steps under bash -eo pipefail and the preceding `curl -sv ... \| tail` aborted the step with curl's exit 7. Booting the proxy as the job's own background step lets any boot crash land in that step's log instead of being swallowed. The default e2e_ui_testing job is unchanged aside from dropping the reboot, prefixed-readiness, and prefixed-smoke steps; the migration smoke still runs at the root mount there via the default Playwright config. * fix(proxy): extend response headers hook to streaming, TTS, image gen, and pass-through (#24232) * fix(proxy): extend response headers hook to streaming, TTS, image gen, and pass-through * test: mock post_call_response_headers_hook in audio speech route tests * chore(ui): remove dead App Router route stubs under (dashboard) (#30045) models-and-endpoints, organizations, and virtual-keys each had a page.tsx route under (dashboard)/ that is not in MIGRATED_PAGES, so the sidebar and deep links never resolve to it and the route is unreachable. Each was a thin wrapper that handed the shared view empty or no-op props (empty modelData with a no-op setModelData, hardcoded empty organizations, no-op setUserRole/setUserEmail), so reaching one would render a degraded page in any case. The real wrapper belongs in the PR that flips each page into MIGRATED_PAGES, written with eyes on it and a test This continues the dead-scaffolding cleanup from #28891. The shared components these wrappers rendered (ModelsAndEndpointsView, OrganizationFilters) stay, since the legacy ?page= switch in app/page.tsx and src/components still import them * fix(ui/mcp): reset OAuth state on create-server modal close so a prior server's token no longer leaks into the next add-server session (#30000) * fix(ui/mcp): reset OAuth hook state on modal close so a prior server's token no longer leaks into the next add-server session * fix(ui/mcp): clear in-flight OAuth guard on reset and reset form/tools on modal close so nothing leaks on a parent-driven dismiss * fix(mcp): allow team access-group grants in OAuth authorize/token access check (#30041) * fix(mcp): honor team access-group grants in OAuth authorize/token access check * test(mcp): mock build_effective_auth_contexts in non-admin authorize tests for isolation * docs(security): require a reproduction video for vulnerability reports (#30048) (#30063) With AI models capable of automated vulnerability discovery now publicly available, we expect a large increase in report volume, much of it unverified. Requiring a video of the exploit running against a live instance raises the bar for submissions and keeps triage focused on reproducible issues. Reports without a video will be closed and reopened if one is added later. Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com> * feat(ui): add admin flag to disable in-product UI nudges for everyone (#29796) * feat(ui): add admin flag to disable in-product UI nudges for everyone Admins can now suppress the survey and Claude Code feedback popups for all users via a single disable_ui_nudges UI setting, instead of relying on each user dismissing them individually. * fix(ui): suppress nudges while ui settings are loading Gate nudgesDisabled on the ui-settings loading state so an admin with disable_ui_nudges on doesn't see the survey prompt flash, and the getInProductNudgesCall fetch doesn't fire, on a cold page load before the flag resolves. Falls back to showing nudges if the fetch errors. * test(ui): wrap CreateKeyPage test in QueryClientProvider page.tsx now calls useUISettings (react-query), which needs a QueryClient that layout.tsx supplies in production but the test did not. Add the provider and mock getUiSettings so the query resolves. * chore(ui): remove dead dashboard files and unused dependencies (#30047) * chore(ui): remove dead dashboard files and unused dependencies knip flagged seven orphaned source/config files with no importers and five declared dependencies that nothing in the tree uses. Removing them shrinks the dashboard bundle's source surface and keeps the manifest honest; vite stays installed transitively via vitest, so test tooling is unaffected. * fix(ci): restore serverRootPath.config.ts referenced by SERVER_ROOT_PATH workflow The dead-code sweep removed e2e_tests/serverRootPath.config.ts, but its spec (tests/login/serverRootPathRedirect.spec.ts) and the test_server_root_path.yml workflow step still depend on it, so the redirect e2e job failed to load a config that no longer existed. * fix(proxy): authorize batch files using upload target_model_names (LIT-3593) (#30009) * fix(proxy): authorize batch files using upload target_model_names (LIT-3593) After replace_model_in_jsonl, body.model is a stripped provider id. Reverse-mapping it via resolve_model_name_from_model_id is first-match on model_list and caused false 403s when multiple deployments share the same stripped name. Use target_model_names from the unified file id instead. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(proxy): restore resolve_model_name_from_model_id for JSONL fallback path (LIT-3593) Restores the reverse-lookup for the JSONL body.model fallback path so that legacy/pre-target_model_names managed files still map stripped provider IDs back to proxy aliases before auth. Also cleans up redundant `or None`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert "fix(proxy): restore resolve_model_name_from_model_id for JSONL fallback path (LIT-3593)" This reverts commit 30d2e96f77ef521ccaaf2193fe554980380eb669. --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Add Claude Fable 5 across Anthropic, Bedrock, Vertex AI, and Azure AI (#30064) * Add Claude Fable 5 across Anthropic, Bedrock, Vertex AI, and Azure AI Adds cost map entries for claude-fable-5 ($10/$50 per MTok, 1M context, 128K output, adaptive thinking only) on the Anthropic API, Bedrock converse (base, global, and us/eu geo inference profiles at the 10% regional premium), Vertex AI, and Azure AI (Microsoft Foundry, which serves Fable 5 with the full 1M context window unlike Opus 4.8). Registers anthropic.claude-fable-5 in BEDROCK_CONVERSE_MODELS, lists the model in the setup wizard, and extends the reasoning effort e2e grid. The Bedrock, Vertex, and Azure grid cells carry fail_reason markers until the CI accounts are provisioned: Bedrock needs the provider data sharing opt-in Fable 5 requires, and the Foundry resource needs a claude-fable-5 deployment. The first-party entry carries provider_specific_entry {us: 1.1} for the inference_geo premium and deliberately no fast multiplier since Fable 5 has no fast mode. https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm * Drop removed sampling params for Claude 4.7+ when drop_params is set Fable 5, Opus 4.7, and Opus 4.8 removed sampling params: the API rejects top_p, top_k, and any temperature other than 1 with a 400. LiteLLM was forwarding them even with drop_params enabled because the Anthropic and Bedrock converse transformations passed temperature/top_p through unconditionally. Mirror the GPT-5/o-series handling: temperature=1 still passes through, other values and any top_p are dropped when drop_params is set, and without drop_params a clean client-side UnsupportedParamsError tells the caller how to opt in, instead of surfacing the raw provider error. https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm * Drive sampling param gating from the cost map and cover top_k Greptile review follow-ups on the sampling param fix: the restriction for Fable 5 / Opus 4.7 / 4.8 is now declared as supports_sampling_params: false on every affected cost map entry (perplexity excluded; that route is OpenAI-compatible and maps sampling params upstream) and read back through a tri-state map lookup, keeping the name check only as a fallback for provider-routed ids whose hosted map entries predate the flag, the same layering supports_adaptive_thinking uses. top_k bypasses map_openai_params as a provider-specific kwarg, so it is gated at the shared AnthropicConfig.transform_request boundary (direct, Bedrock invoke, Vertex, Azure) and in the Bedrock converse _handle_top_k_value path, with drop_params threaded through the converse transform helpers. Also updates the reasoning effort grid cell count assertion for the four Fable 5 rows added on this branch (29 x 11 cells). https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm * Declare supports_sampling_params in the cost map schema The model map validation schema uses additionalProperties: false, so the new flag must be declared for the 28 entries that carry it; this was the one failing job (misc / Run tests) on the previous commit. https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm * fix(bedrock): gate top_k=0 on converse to match Anthropic boundary Truthiness check let top_k=0 silently disappear on models that removed sampling params, while AnthropicConfig.transform_request treats 0 as present and raises UnsupportedParamsError (or drops when drop_params is set). Switch to 'is not None' so converse, direct Anthropic, invoke, Vertex, and Azure all behave the same for top_k=0. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> * fix(proxy): coerce litellm_settings.max_budget env var to float When max_budget is set in litellm_settings via os.environ/MAX_BUDGET, the env var resolves to a string and the generic setattr branch in ProxyConfig.load_config stored it as-is, so the startup check litellm.max_budget > 0 raised TypeError. The earlier fix (#23855) only covered the CLI initialize() path. Coerce the value to float in the settings loop, matching the existing max_internal_user_budget handling. Fixes #26696. --------- Co-authored-by: milan-berri <milan@berri.ai> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: ryan-crabbe-berri <ryan@berri.ai> Co-authored-by: michelligabriele <gabriele.michelli@icloud.com> Co-authored-by: tin-berri <tin@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com> Co-authored-by: Sameer Kankute <sameer@berri.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com> * fix(router): don't drop bedrock pass-through deployments using IAM credentials (#30111) * Fix Bedrock passthrough deployment dropped when using IAM credentials Bedrock deployments with use_in_pass_through enabled and IAM/OIDC auth (aws_role_name, no api_key) hit the generic pass-through branch in Router._initialize_deployment_for_pass_through, which calls set_pass_through_credentials and raises "api_key is required". The exception drops the deployment from the router entirely, breaking both passthrough and normal routing for that model. Skip the credential store write when no api_key is set; the bedrock passthrough route resolves AWS credentials at request time via BedrockConverseLLM.get_credentials(), not the passthrough credential store, so there is nothing to register here. Fixes #27728. * Reset passthrough credentials singleton before api_key credential test The test reads the module-level passthrough_endpoint_router singleton, so a stale "openai" entry written by an earlier test in the same process could make the assertion pass without exercising the code path. Clearing the credentials dict up front makes the test order-independent. * fix(sdk): stop mirroring reasoning_content in provider_specific_fields (#30110) The dict-to-response conversion path mirrored reasoning_content into provider_specific_fields, while live provider transforms (Anthropic's _build_provider_specific_fields) only set it top-level on the Message. Cache-replayed messages therefore serialized differently from live ones, breaking disk cache key stability for multi-turn conversations with extended thinking. The mirror was added for DeepSeek before Message.reasoning_content existed as a top-level attribute. The top-level field is still set by the converter, so DeepSeek's request-side promotion is unaffected. Fixes #27337. * fix(mcp): coerce mcp_server_cost_info values to float at ingest (#30109) * fix(mcp): coerce mcp_server_cost_info values to float at ingest YAML 1.1 parses scientific notation without a decimal point (e.g. 7e-05) as a string, and MCPServerCostInfo is a TypedDict with no runtime validation, so a string-typed default_cost_per_query from config.yaml flowed through the proxy untouched and crashed the MCP server settings page with '.toFixed is not a function'. Normalize mcp_server_cost_info on both the config and DB load paths, dropping non-numeric values with a warning instead of failing the server load. Fixes #27097. * fix(mcp): drop non-numeric default_cost_per_query instead of nulling it Keeping the key with a None value still exposes a null to the UI, which can crash .toFixed formatting when the consumer checks key existence rather than truthiness. Delete the key on coercion failure, matching how non-numeric per-tool cost entries are already omitted. * fix(proxy): count embedding and text completion tokens toward TPM limits (#30105) * fix(proxy): count embedding and text completion tokens toward TPM limits The parallel request limiters only read token usage off ModelResponse, so EmbeddingResponse and TextCompletionResponse objects left total_tokens at 0 and the per key, user, team, and end user TPM counters never incremented. Requests to /v1/embeddings and /v1/completions were effectively free against any tpm_limit. In the v3 limiter this was worse: the post-call reconciliation computed actual usage as 0 and refunded the pre-call reservation made at request time. Broaden the isinstance checks to accept EmbeddingResponse and TextCompletionResponse, which both expose a Usage object, at the four per-scope sites in parallel_request_limiter.py and at the usage extraction in parallel_request_limiter_v3.py. ResponsesAPIResponse was already covered in v3 via BaseLiteLLMOpenAIResponseObject. Fixes #27738. * test(proxy): cover v1 limiter TPM counting for embedding and text completion responses Exercise the broadened isinstance sites in parallel_request_limiter.py by asserting that async_log_success_event adds total_tokens to the per key, user, team, and end user TPM counters for EmbeddingResponse and TextCompletionResponse objects. The counters are pre-seeded at zero so the assertion is exactly the increment; on the pre-fix code these responses left total_tokens at 0 and the test fails. * fix(openai): forward client headers on the text completion path (#30103) * fix(openai): forward client headers on the text completion path litellm.completion() merges caller headers with extra_headers, but the text-completion-openai branch never passed the merged dict to openai_text_completions.completion(), and the handler only used its headers argument for logging. Pass the merged headers through the call site and set them as extra_headers on the outgoing request, mirroring the chat completion handler, so x-* client headers forwarded by the proxy reach the provider on /v1/completions. Fixes #27410. * Drop redundant extra_headers assignment and fix test module collision completion() merges extra_headers into headers before the text-completion-openai branch, and the handler now sets the merged headers as extra_headers on the request, so the branch-local optional_params["extra_headers"] assignment was a dead duplicate. Removing it keeps the assignment in one place while both entry paths (litellm.text_completion and direct handler callers) still forward headers; a new regression test pins the extra_headers kwarg path. Also rename the test module to test_completion_handler.py since its basename collided with tests/test_litellm/llms/bedrock/batches/ test_handler.py and broke pytest collection. * fix(bedrock): route Anthropic-shape count_tokens to InvokeModel and base64-encode the body (#30102) * fix(bedrock): route Anthropic-shape count_tokens to InvokeModel POST /v1/messages/count_tokens with Anthropic content blocks ({"type": "text"\|"tool_use"\|...}) was routed to the Converse input of the Bedrock CountTokens API. The Converse transform copies list content through verbatim, so Bedrock rejected the request with a 400 and the caller silently fell back to the local tokenizer, returning counts that can be off by ~50% on tool-heavy payloads. _detect_input_type now routes messages whose content blocks carry a "type" key (Anthropic shape) to the invokeModel input, which forwards the body verbatim. The invokeModel body is now base64-encoded as the CountTokens API requires (InvokeModelTokensRequest.body is a base64-encoded blob), and Anthropic Messages bodies get the anthropic_version and max_tokens fields Bedrock validates against. Fixes #27632. * refactor(bedrock): name the CountTokens max_tokens placeholder Replace the magic 1024 with a module-level DEFAULT_ANTHROPIC_INVOKE_MODEL_MAX_TOKENS constant so the intent is explicit and there is a single place to update if Bedrock's InvokeModel schema ever changes. Module-local rather than litellm/constants.py because the value is only a schema-validation placeholder for token counting, not a user-tunable generation default. * Add above-512k pricing tier for MiniMax-M3 and correct its base rates (#30095) * Add above-512k pricing tier support for MiniMax-M3 MiniMax-M3 doubles its per-token rates once a prompt exceeds 512k input tokens. The tiered cost parser already handles arbitrary thresholds, but get_model_info only copies whitelisted keys from ModelInfoBase, which had no 512k variants, so above_512k keys were silently dropped and long-context requests were priced at the flat rate. Add the input, output, and cache-read above_512k_tokens fields to ModelInfoBase and pass them through in get_model_info. Update the minimax/MiniMax-M3 entry with the tiered rates and correct the base rates, which matched the above-512k tier instead of the published base tier (https://platform.minimax.io/docs/guides/pricing-paygo). Fixes #29663. * Add above-512k keys to pricing schema, set MiniMax-M3 context to 1M Register the three new above_512k_tokens cost keys in the INTENDED_SCHEMA of test_aaamodel_prices_and_context_window_json_is_valid, declared the same way as the existing above_200k/above_272k tier keys, so the schema check accepts the MiniMax-M3 tiered pricing entry. Also raise MiniMax-M3 max_input_tokens from 512000 to 1000000 in both pricing JSONs. The MiniMax API docs (https://platform.minimax.io/docs/guides/text-generation) state the model supports a 1,000,000-token context window, and the pay-as-you-go pricing page (https://platform.minimax.io/docs/guides/pricing-paygo) prices input above 512k tokens, which only makes sense if inputs beyond 512k are accepted. This makes the above-512k pricing tier reachable. * fix(bedrock): make document names unique across conversation turns (#30093) * fix(bedrock): make document names unique across conversation turns PR #16275 derived Bedrock document names purely from a content hash so that names stay deterministic for prompt caching. When the same PDF or document appears in more than one conversation turn, every occurrence gets the identical name and Bedrock rejects the request with "Messages can not contain duplicate document names". Add _rename_duplicate_bedrock_document_names, a post-pass over the assembled message blocks that keeps the first occurrence's hash-based name and appends a positional suffix (_2, _3, ...) to later occurrences. Apply it in both _bedrock_converse_messages_pt and _bedrock_converse_messages_pt_async. Names remain deterministic across requests and the first occurrence is unchanged, so prompt cache prefixes stay stable. Fixes #29418. * fix(bedrock): avoid suffix collisions with organic document names A renamed duplicate could collide with a document whose hash-derived name already ends in the same positional suffix (e.g. an organic report_2 next to two documents named report). Collect every document name up front and bump the suffix until the candidate is unused, so renames can collide neither with organic names nor with each other. * fix(_types): remove ResponsesAPIResponse from PassThroughEndpointLoggingResultValues The import of ResponsesAPIResponse was removed from the file but a usage was left in the Union type, causing a NameError on import and breaking all CI tests. Remove the stale reference to match the cleanup intent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(_types): restore ResponsesAPIResponse import and add use_xai_oauth to filter list Two related fixes: 1. Re-add ResponsesAPIResponse import in _types.py — it was removed but still needed in PassThroughEndpointLoggingResultValues (used in openai_passthrough_logging_handler.py). 2. Add use_xai_oauth to all_litellm_params so it is filtered before forwarding kwargs to providers like OpenAI that do not recognize it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Hari <kancharla.ha@northeastern.edu> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Ceder Dens <ceder.dens@uantwerpen.be> Co-authored-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Co-authored-by: 冯基魁 <56265583+fengjikui@users.noreply.github.com> Co-authored-by: victoruce <161634297+victoruce@users.noreply.github.com> Co-authored-by: kejunleng <33445544+silencedoctor@users.noreply.github.com> Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: Tyson Cung <45380903+tysoncung@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jeremy Chapeau <113923302+jychp@users.noreply.github.com> Co-authored-by: Daan <255322319+daanhendrio@users.noreply.github.com> Co-authored-by: Avani Prajapati <143805019+Avani-prajapati@users.noreply.github.com> Co-authored-by: Kent <72616338+kingdoooo@users.noreply.github.com> Co-authored-by: daitran-tensormesh <dai@tensormesh.ai> Co-authored-by: Dimitris Spachos <dspachos@gmail.com> Co-authored-by: Liam Scott <liam@uilliam.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com> Co-authored-by: milan-berri <milan@berri.ai> Co-authored-by: ryan-crabbe-berri <ryan@berri.ai> Co-authored-by: michelligabriele <gabriele.michelli@icloud.com> Co-authored-by: tin-berri <tin@berri.ai> Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com> Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com>	2026-06-10 10:34:07 -07:00
michelligabriele	2fe9feda71	fix(caching): restore stored prompt_tokens on embedding cache hits instead of recomputing (#30046 )	2026-06-10 15:49:20 +05:30
Mateo Wang	e15b37a18e	Add Claude Fable 5 across Anthropic, Bedrock, Vertex AI, and Azure AI (#30064 ) * Add Claude Fable 5 across Anthropic, Bedrock, Vertex AI, and Azure AI Adds cost map entries for claude-fable-5 ($10/$50 per MTok, 1M context, 128K output, adaptive thinking only) on the Anthropic API, Bedrock converse (base, global, and us/eu geo inference profiles at the 10% regional premium), Vertex AI, and Azure AI (Microsoft Foundry, which serves Fable 5 with the full 1M context window unlike Opus 4.8). Registers anthropic.claude-fable-5 in BEDROCK_CONVERSE_MODELS, lists the model in the setup wizard, and extends the reasoning effort e2e grid. The Bedrock, Vertex, and Azure grid cells carry fail_reason markers until the CI accounts are provisioned: Bedrock needs the provider data sharing opt-in Fable 5 requires, and the Foundry resource needs a claude-fable-5 deployment. The first-party entry carries provider_specific_entry {us: 1.1} for the inference_geo premium and deliberately no fast multiplier since Fable 5 has no fast mode. https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm * Drop removed sampling params for Claude 4.7+ when drop_params is set Fable 5, Opus 4.7, and Opus 4.8 removed sampling params: the API rejects top_p, top_k, and any temperature other than 1 with a 400. LiteLLM was forwarding them even with drop_params enabled because the Anthropic and Bedrock converse transformations passed temperature/top_p through unconditionally. Mirror the GPT-5/o-series handling: temperature=1 still passes through, other values and any top_p are dropped when drop_params is set, and without drop_params a clean client-side UnsupportedParamsError tells the caller how to opt in, instead of surfacing the raw provider error. https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm * Drive sampling param gating from the cost map and cover top_k Greptile review follow-ups on the sampling param fix: the restriction for Fable 5 / Opus 4.7 / 4.8 is now declared as supports_sampling_params: false on every affected cost map entry (perplexity excluded; that route is OpenAI-compatible and maps sampling params upstream) and read back through a tri-state map lookup, keeping the name check only as a fallback for provider-routed ids whose hosted map entries predate the flag, the same layering supports_adaptive_thinking uses. top_k bypasses map_openai_params as a provider-specific kwarg, so it is gated at the shared AnthropicConfig.transform_request boundary (direct, Bedrock invoke, Vertex, Azure) and in the Bedrock converse _handle_top_k_value path, with drop_params threaded through the converse transform helpers. Also updates the reasoning effort grid cell count assertion for the four Fable 5 rows added on this branch (29 x 11 cells). https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm * Declare supports_sampling_params in the cost map schema The model map validation schema uses additionalProperties: false, so the new flag must be declared for the 28 entries that carry it; this was the one failing job (misc / Run tests) on the previous commit. https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm * fix(bedrock): gate top_k=0 on converse to match Anthropic boundary Truthiness check let top_k=0 silently disappear on models that removed sampling params, while AnthropicConfig.transform_request treats 0 as present and raises UnsupportedParamsError (or drops when drop_params is set). Switch to 'is not None' so converse, direct Anthropic, invoke, Vertex, and Azure all behave the same for top_k=0. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com>	2026-06-10 08:50:15 +05:30
Sameer Kankute	2cd7e87485	fix(proxy): authorize batch files using upload target_model_names (LIT-3593) (#30009 ) * fix(proxy): authorize batch files using upload target_model_names (LIT-3593) After replace_model_in_jsonl, body.model is a stripped provider id. Reverse-mapping it via resolve_model_name_from_model_id is first-match on model_list and caused false 403s when multiple deployments share the same stripped name. Use target_model_names from the unified file id instead. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(proxy): restore resolve_model_name_from_model_id for JSONL fallback path (LIT-3593) Restores the reverse-lookup for the JSONL body.model fallback path so that legacy/pre-target_model_names managed files still map stripped provider IDs back to proxy aliases before auth. Also cleans up redundant `or None`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Revert "fix(proxy): restore resolve_model_name_from_model_id for JSONL fallback path (LIT-3593)" This reverts commit 30d2e96f77ef521ccaaf2193fe554980380eb669. --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-10 08:22:15 +05:30
ryan-crabbe-berri	248176112e	feat(ui): add admin flag to disable in-product UI nudges for everyone (#29796 ) * feat(ui): add admin flag to disable in-product UI nudges for everyone Admins can now suppress the survey and Claude Code feedback popups for all users via a single disable_ui_nudges UI setting, instead of relying on each user dismissing them individually. * fix(ui): suppress nudges while ui settings are loading Gate nudgesDisabled on the ui-settings loading state so an admin with disable_ui_nudges on doesn't see the survey prompt flash, and the getInProductNudgesCall fetch doesn't fire, on a cold page load before the flag resolves. Falls back to showing nudges if the fetch errors. * test(ui): wrap CreateKeyPage test in QueryClientProvider page.tsx now calls useUISettings (react-query), which needs a QueryClient that layout.tsx supplies in production but the test did not. Add the provider and mock getUiSettings so the query resolves.	2026-06-09 17:45:42 -07:00
tin-berri	5b7063d194	fix(mcp): allow team access-group grants in OAuth authorize/token access check (#30041 ) * fix(mcp): honor team access-group grants in OAuth authorize/token access check * test(mcp): mock build_effective_auth_contexts in non-admin authorize tests for isolation	2026-06-09 14:19:11 -07:00
michelligabriele	fe60f9d0f1	fix(proxy): extend response headers hook to streaming, TTS, image gen, and pass-through (#24232 ) * fix(proxy): extend response headers hook to streaming, TTS, image gen, and pass-through * test: mock post_call_response_headers_hook in audio speech route tests	2026-06-09 22:10:23 +02:00
milan-berri	d84499e0f2	fix(team): reserve team budget raises for proxy admins on /team/update (#30030 ) The caller's PERSONAL max_budget was the wrong yardstick for /team/update: a team's spend ceiling has nothing to do with the admin's own key budget. That comparison was an unintended side effect of reusing _check_user_team_limits() (which exists for the /team/new path) and broke the UI, which re-sends the unchanged budget on every save. New behavior on /team/update for standalone teams: - A team admin (already authorized via _verify_team_access) may freely KEEP or LOWER the team budget, and change models/tpm/rpm, without being gated by their personal limits. - GROWING a team's spend ceiling is a budget-authority action reserved for proxy admins -> 403 for team admins. "Growing" covers both raising max_budget above the team's current finite value and removing the cap entirely (max_budget=null, detected via model_fields_set so an explicit null is distinguished from an omitted field). For a team that currently has no cap, setting a finite value is a restriction and is allowed. - Org-scoped teams remain governed by _check_org_team_limits() (capped by the org budget). Also reverts the #29525 existing_team_max_budget workaround in _check_user_team_limits() back to the create-only form; /team/new still enforces the creator's personal caps. docs(access_control): resolve the contradiction in the team-admin section — team admins can keep/lower the budget and manage rate limits/models, but cannot raise the team budget (proxy-admin only). tests: unit + behavior coverage for raise-blocked, cap-removal-blocked (team admin), raise/removal allowed (proxy admin), uncapped-team restriction allowed, keep/lower/resend allowed, and unchanged create-path guards. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-09 09:19:15 -07:00
tin-berri	51ba6e39cd	fix(mcp): load MCP tool configuration tools via the OBO/passthrough-aware GET path (#29960 ) * fix(ui): load MCP tool configuration tools via the OBO/passthrough-aware GET path * fix(mcp): admin-only include_disabled_tools so the settings UI shows toggled-off tools * fix(ui): repopulate MCP server edit form when server data loads after mount (OAuth return) * fix(ui): persist MCP OAuth token on save and return to the Settings tab after authorize * fix(ui): scope MCP OAuth callback to the initiating form so create and edit flows don't cross-talk * fix(ui): derive OAuth-return Settings tab via lazy state init instead of setState-in-effect * Fix MCP OAuth edit token handling --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com>	2026-06-08 19:58:51 -07:00
Sameer Kankute	424db6a980	feat(azure_ai): add MAI-Image-2.5 image generation support (#29688 ) * feat(azure_ai): add MAI-Image-2.5 image generation support Route azure_ai MAI models to /mai/v1/images/generations and map OpenAI size to width/height for the serverless API. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(azure_ai): address MAI image generation review feedback Validate unsupported size values, default width/height independently, add MAI-Image-2.5 pricing, and expand test coverage. @greptileai Co-authored-by: Cursor <cursoragent@cursor.com> * feat(azure_ai): add MAI image edit and expand model cost map Add MAI image edit support with usage normalization for Azure response format, and register MAI-Image-2.5-Flash and MAI-Image-2e pricing in the model map. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(azure_ai): validate MAI edit size by consuming map iterator Greptile: lazy map() never evaluated int() so values like 1024xabc passed through. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(azure_ai): normalize MAI usage in generation response handler Apply normalize_mai_image_usage before building ImageResponse so token-based cost calculation works when Azure returns num_output_tokens fields. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(azure_ai): narrow MAI edit size param type for mypy Co-authored-by: Cursor <cursoragent@cursor.com> * Fix Azure MAI image response handling * Fix MAI image generation base model routing * fix(azure_ai): preserve zero num_output_tokens in MAI usage normalization * fix(azure_ai): wrap MAI generation response JSON parsing in error handling * fix(azure_ai): build MAI image edit URL correctly for /mai/ root bases * fix(azure_ai): build MAI image generation URL correctly for /mai/ root bases --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>	2026-06-08 18:27:04 -07:00
tin-berri	92817cb65b	changing expires_in default to use actual slack return details (#29951 )	2026-06-08 18:13:06 -07:00
yuneng-jiang	1bbaf1c39d	fix(guardrails): read CrowdStrike AIDR identity from both metadata bags (#29991 ) Capture user_id and extra_info from metadata or litellm_metadata. The single-bag read dropped identity whenever a request carried a present litellm_metadata field (null or a user-supplied dict), since /chat/completions routes the authenticated identity into metadata while the guardrail read litellm_metadata first	2026-06-08 17:46:28 -07:00
milan-berri	411bd3da5b	feat(vantage): include organization metadata in FOCUS Tags export (#28184 ) * feat(vantage): include organization metadata in FOCUS Tags export Join LiteLLM_OrganizationTable when building Vantage/FOCUS export rows so organization_id and organization_alias appear in Tags for org-level filtering. Co-authored-by: Cursor <cursoragent@cursor.com> * test(focus): include api_requests in organization Tags tests FocusTransformer now requires api_requests after staging merge; add the column to test fixtures so integrations CI can run the Tags assertions. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-09 02:59:21 +03:00
yuneng-jiang	c24a3603d9	fix(team-management): delete a team's BYOK models when the team is deleted (#29977 ) A team's BYOK models (rows in LiteLLM_ProxyModelTable with model_info.team_id set) were left orphaned when the team was deleted; they lingered in the database and kept showing on the Models + Endpoints page. delete_team now removes them via a new delete_team_models helper that deletes the rows in one transaction and syncs the in-memory router only after that transaction commits, run before the team rows are deleted so a mid-flight failure never leaves the team gone with its models orphaned	2026-06-08 16:55:35 -07:00
Sameer Kankute	dfd6cbc514	fix(vertex): propagate Vertex AI metadata in streaming success callbacks (#29899 ) * fix(vertex): propagate Vertex AI metadata in streaming success callbacks Streaming calls assembled via stream_chunk_builder were missing vertex_ai_grounding_metadata and vertex_ai_url_context_metadata in standard_logging_object.response. Merge metadata from chunks into the assembled response and mirror non-streaming hidden_params on Gemini chunks. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(vertex): move streaming metadata merge into provider config hook Address review feedback by delegating assembled-stream metadata propagation to VertexGeminiConfig via BaseConfig.apply_assembled_streaming_response_metadata, and only write chunk hidden_params when metadata is non-empty. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(redaction): scrub Vertex provider metadata when message logging is off Clear vertex_ai_grounding_metadata and related fields from standard logging responses and assembled streaming ModelResponse objects so turn_off_message_logging cannot leak prompt-derived web search queries. Co-authored-by: Cursor <cursoragent@cursor.com> * Use assembled model for streaming metadata hook * Fix Vertex metadata redaction bypass in logging callbacks. Scrub Vertex provider fields from litellm_params.metadata.hidden_params during perform_redaction so streaming success_handler merges do not leak prompt-derived metadata when message logging is disabled. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix Vertex streaming metadata from hidden params * fix(vertex): mirror vertex_ai_safety_results on assembled streaming responses The non-streaming transform_response stores safety data under vertex_ai_safety_results, but the streaming path only wrote vertex_ai_safety_ratings. Assembled streaming responses therefore never carried vertex_ai_safety_results, so any consumer reading that field saw a silent difference between streaming and non-streaming calls. Set vertex_ai_safety_results alongside vertex_ai_safety_ratings in the shared stream metadata setter and add it to the assembled metadata field list so it propagates through stream_chunk_builder. * fix(streaming): log provider streaming metadata hook failures instead of swallowing them * refactor(vertex): share single Vertex metadata field tuple across redaction and streaming * refactor(vertex): move Vertex metadata redaction helpers into llms/vertex_ai --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>	2026-06-08 16:14:30 -07:00
milan-berri	1c881eee5d	fix(fireworks): enable tool calling for glm-5p1 in model cost map (#29697 ) glm-5p1 supports native tools on Fireworks; explicit false flags caused drop_params to strip tools and tool_choice before the provider request. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-08 15:54:19 -07:00
milan-berri	9ccda11919	fix(team_endpoints): don't block /team/update on unchanged team budget (#29525 ) On /team/update for a standalone (no-org) team, _check_user_team_limits() compared the request max_budget against the caller's personal max_budget whenever max_budget was present in the payload. A team admin whose personal budget is lower than the team's budget could not edit any field (tpm_limit, team name, etc.) because the UI re-sends the unchanged max_budget on every update, tripping the personal-budget check. Pass the team's current max_budget into _check_user_team_limits() and skip the personal-budget comparison when the incoming value is unchanged or lower than the team's current budget. Only genuine increases above the team's current budget are still validated against the caller's personal limit, so no over-relaxation. Proxy admins and the org-scoped path are unaffected. Adds two regression tests for the standalone update path (unchanged budget + tpm_limit change, and lowering the budget), both for a caller whose personal budget is below the team budget. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-09 01:14:24 +03:00
milan-berri	a7ecf6b5b1	feat(jwt-auth): opt-in fallback to DB team on unresolved JWT claim (#28913 ) * fix(jwt-auth): defer to single-team DB fallback on claim mismatch Extends the single-team DB fallback introduced in #26418 to two more cases where it previously could not run: * `find_and_validate_specific_team_id`: when `team_id_jwt_field` is configured and a claim value is present in the token but the team does not exist in the LiteLLM DB (HTTPException 404 from `get_team_object`), return `(None, None)` instead of raising — the auth_builder fallback then attributes the request to the user's single DB team. Only HTTPException is caught; other errors (e.g. "No DB Connected") still propagate. * `find_team_with_model_access`: when none of the `team_ids_jwt_field` groups resolve to a real LiteLLM team, return `(None, None)` instead of raising 403 so the same fallback path runs. If at least one group DID resolve to a team but none granted the requested model, the original 403 is preserved (legitimate access denial — not a claim mismatch). Tracked via the new `any_claim_team_resolved` flag. The strict `is_required_team_id` raise and `enforce_team_based_model_access` raise remain unchanged. Unit tests cover both new soft-fail paths and guard each preserved path (strict required, enforce_team_based, the preserved 403, and the non-HTTPException propagation). Co-authored-by: Cursor <cursoragent@cursor.com> * fix(jwt-auth): narrow HTTPException catch to 404 (greptile review) Address Greptile review comments on #28913: * `find_and_validate_specific_team_id`: re-raise HTTPException when `status_code != 404`, pinning the catch to the "team doesn't exist in db" path documented for `get_team_object`. A future change that introduces a different status code (e.g. 403 for a blocked team) will now propagate instead of silently falling through to the single-team DB fallback. * Add `test_find_and_validate_specific_team_id_non_404_http_exception_propagates` parametrised over 400 / 403 / 500 to lock in the contract. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(jwt-auth): gate claim-mismatch fallback behind opt-in flag The unresolved-team-claim fallback added in the previous commit weakened the strict claim-based authorization contract by default — an authenticated user whose JWT carries a stale or invalid team claim could still consume their single DB team's models/quota via the fallback. Gate both soft-fail paths in `find_and_validate_specific_team_id` and `find_team_with_model_access` behind a new opt-in flag `team_claim_fallback` on `LiteLLM_JWTAuth` (default False). Default-off preserves the pre-existing strict behavior. Operators who intentionally treat IdP team claims as advisory (e.g. machine tokens whose group claims live in a separate namespace from LiteLLM team_ids) opt in via config. Adds two regression tests guarding the default-off behavior. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-09 01:09:03 +03:00
yuneng-jiang	69a7bdb247	fix(model-management): allow deleting a BYOK model after its team is deleted (#29875 ) * fix(model-management): allow deleting a BYOK model after its team is deleted A team BYOK model (model_info.team_id set) became undeletable once its team was deleted: POST /model/delete ran can_user_make_model_call, which looked the team up and raised 400 "Team id=... does not exist in db" before the delete could run, so the model lingered on the Models + Endpoints page with no way to remove it. Drop the team-existence prerequisite from the delete path. When the model's team still exists the normal auth check runs unchanged; when it is gone a proxy admin may delete the orphan and any other caller gets a 403. The check is fail-closed, so a missing or errored team lookup can only block the delete or require an admin, never grant a non-admin access. Add/update/health keep their team-existence validation. * refactor(model-management): drop redundant team lookup on model delete Move the orphaned-team handling into can_user_make_model_call behind an allow_missing_team flag instead of pre-checking team existence in delete_model. The endpoint no longer issues its own litellm_teamtable lookup, so deleting a model whose team still exists hits the team table once instead of twice. The auth behavior is unchanged: a proxy admin can delete a model whose team was deleted, any other caller gets a 403, and add/update/health keep the strict "team must exist" validation.	2026-06-08 14:28:39 -07:00
Sameer Kankute	dfb68a23de	feat(galileo): add health check support for UI callback test (#29908 ) * feat(galileo): add health check support for UI callback test Register galileo in /health/services so the proxy UI callback connection test works. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(galileo): verify API key via /current_user health check Call Galileo's current_user endpoint so the UI callback test validates credentials against the provider. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(ui): regenerate schema.d.ts for galileo health service Co-authored-by: Cursor <cursoragent@cursor.com> * fix(galileo): return IntegrationHealthCheckStatus from async_health_check Fixes mypy assignment error in health_services_endpoint where response was narrowed to IntegrationHealthCheckStatus from earlier branches. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix Galileo logging to match Langfuse across all endpoint types. Stop skipping ingest when output is empty and log embeddings with a placeholder so embedding, speech, and other non-text responses are recorded like Langfuse. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(galileo): remove unreachable health-check guard and None output sentinel The use_v2_api flag is derived from bool(api_key), so the inner GALILEO_API_KEY check inside the v2 branch could never run; collapse the credential validation into the username/password path with a combined message. _serialize_galileo_output now returns an empty string for None, so _get_galileo_input_output_content always yields a str and the post-call None coalescing guard is no longer needed. * test(galileo): cover async_health_check failure paths and empty model response Add regression tests for the Galileo health check unhealthy branches (missing project id, missing base url, missing credentials, auth failure, and request exception) and for logging a model response with no choices, which now queues an empty output instead of being skipped. --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>	2026-06-08 13:57:03 -07:00
Sameer Kankute	32c88ca74f	Litellm oss staging 080626 (#29932 ) * feat(bedrock_mantle): add SigV4/IAM auth to Responses API route (fixes #29665) (#29788) * feat(responses): add default no-op sign_request to BaseResponsesAPIConfig * feat(responses): call sign_request after body is final, send signed bytes when signed * feat(bedrock_mantle): add SigV4 sign_request via composed BaseAWSLLM (bearer path) * test(bedrock_mantle): cover SigV4 access-key, AssumeRole, body bytes, region/auth consistency * feat(bedrock_mantle): defer auth to sign_request; validate_environment no longer requires bearer * docs(bedrock_mantle): document SigV4 + Bearer auth on Responses route * test(responses): cover fake-stream signing order and mantle bearer arg/env precedence * fix(bedrock_mantle): wrap all botocore credential errors with both-paths guidance * fix(bedrock_mantle): catch specific credential errors, not all BotoCoreError, so STS transport failures are not masked * fix(bedrock_mantle): sign the compact Responses route too, not just create * fix(github-copilot): route per-model on /v1/responses based on model info (#29747) * feat(focus): add GCS destination for FOCUS export (#29751) * test: add failing tests for FocusGCSDestination * feat: add FocusGCSDestination reusing GCSBucketBase auth * feat: register FocusGCSDestination in factory; export from __init__ * fix(focus): preserve GCS_PATH_SERVICE_ACCOUNT when service_account_json not in config * style: apply Black formatting to gcs_destination and tests * style: apply Black formatting to factory.py * fix(bedrock): omit empty additionalModelRequestFields and system from Converse API payload (#29565) Amazon Nova Pro (and other strict Bedrock models) return 400 Malformed input request when additionalModelRequestFields: {} or system: [] are present in the payload. Both fields are optional in CommonRequestObject (total=False) and must be omitted rather than sent as empty structures. Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(proxy): recognize .cognitiveservices.azure.com as OpenAI-compatible in pass-through cost tracking (#29730) fix(proxy): recognize .cognitiveservices.azure.com as OpenAI-compatible Azure OpenAI resources created via the newer "Azure AI Foundry" / Cognitive Services pathway live on `.cognitiveservices.azure.com` subdomains, not the older `openai.azure.com`. Both are valid Azure OpenAI surfaces in production today. The OpenAI pass-through cost-tracking handler hard-codes only the older hostname in five places (four `is_openai__route` methods on OpenAIPassthroughLoggingHandler, plus is_openai_route on PassThroughEndpointLogging). As a result, calls from newer Azure deployments are silently classified as "not an OpenAI route", the dispatch into the cost-tracking handler is skipped, and tokens/cost never get extracted into LiteLLM_SpendLogs — the row gets written with prompt_tokens=0, completion_tokens=0, spend=0, model='unknown'. Reproduced 2026-06-04 against a real Azure OpenAI deployment on `.cognitiveservices.azure.com` proxied through LiteLLM v1.88.0. Fix: factor the hostname check into a single helper `_is_openai_compatible_host` listing all three recognized surfaces (api.openai.com, openai.azure.com, cognitiveservices.azure.com), and have all five call sites delegate to it. Purely additive — never weakens recognition for the originally-supported hostnames. Adds a test `test_is_openai_route_recognizes_cognitiveservices_azure_com` that exercises all four `is_openai__route` static methods against `.cognitiveservices.azure.com` URLs (positive cases per route + a small cross-route negative to confirm route-specific path matching still works on the new hostname). Out of scope for this PR (separate followup): - `openai_passthrough_handler` calls chat/completions `transform_response` on Responses API payloads (`output:` not `choices:`), which throws inside the dispatch and drops the SpendLogs row entirely. Recognized + tracked separately. * ci: trigger fresh run Empty commit to re-run checks. The previous auth-and-jwt failure was a transient HuggingFace Hub 429 rate-limit hitting tokenizer downloads in tests/proxy_unit_tests/test_custom_tokenizer_bug.py — unrelated to this PR's scope (hostname recognition in pass-through cost tracking). No code change. --------- Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> * fix(responses): preserve forced-function tool_choice name in Responses to Chat transform (#29812) The Responses API forces a specific function with a top-level name ({"type": "function", "name": "X"}), but _transform_tool_choice only handled the nested Chat Completions shape and fell through to returning "required" for the flat form, silently dropping the function name and degrading a forced function call to force-any-tool. Map the flat Responses shape to the nested Chat shape, keeping the "required" fallback when no name is present. * Preserve x-anthropic-billing-header system blocks for first-party Anthropic (#29584) * Preserve x-anthropic-billing-header system blocks for first-party Anthropic PR #20951 strips system blocks beginning with "x-anthropic-billing-header:" for every Anthropic target. That block is how the first-party Anthropic API recognizes Claude Code subscription (OAuth) traffic, so dropping it makes requests that carry only that block, such as the auto-mode tool-safety classifier, fail with a misleading 429 rate_limit_error; normal turns still work because they also carry the "You are Claude Code" identity block. Gate the strip behind should_strip_billing_metadata(), defaulting to False on the first-party AnthropicConfig and AnthropicMessagesConfig so the block is kept, and overridden to True on the providers that reach these transforms and reject the block (Bedrock platform, Vertex, Azure for the chat path; Minimax, Azure, DeepSeek for the messages path). Behavior for those providers is unchanged. * Strip billing header on Bedrock invoke and Vertex messages pass-through Two more subclasses reach the gated strip but inherited keep-by-default. AmazonAnthropicClaudeConfig (Bedrock invoke) calls AnthropicConfig.transform_request, which calls translate_system_message, and VertexAIPartnerModelsAnthropicMessagesConfig (Vertex messages pass-through) calls super().transform_anthropic_messages_request. Override should_strip_billing_metadata() to True on both. Add a parametrized test asserting the flag for every first-party base (False) and provider subclass (True), covering all overrides, plus a translate_system_message regression test for the Bedrock invoke path. * fix(cache): log hashed cache keys (#29890) * fix(ui): save routing groups as list (#29889) * Revert "fix(ui): save routing groups as list (#29889)" (#29928) This reverts commit 9b1f78ffa7a309cabe5e9a7ab5f94d1224d192c9. * feat(parasail): add Parasail as a JSON-configured OpenAI-compatible provider (#29842) * feat(parasail): add Parasail as a JSON-configured OpenAI-compatible provider Registers parasail in the openai_like JSON provider loader with both /v1/chat/completions and /v1/responses support. Parasail's Responses API rejects store:true and any request that omits store, so the loader gains a force_store_false special_handling flag; the parasail entry sets it and the generated Responses config overrides store=false on every call. This keeps callers from hitting "State storage not supported" and matches what Parasail's docs require. Adds the PARASAIL enum value, listing under openai_compatible_providers, provider documentation at docs/my-website/docs/providers/parasail.md, and a focused unit test file under tests/test_litellm/llms/parasail/ that covers JSON registration, chat URL construction, Responses URL construction with PARASAIL_API_BASE override, and the force_store_false regression in both the caller-sent-store=true and caller-omitted cases. * fix(parasail): register in provider_endpoints_support, drop in-repo docs Greptile review feedback. The provider doc belongs in the litellm-docs repo, not this one's docs/my-website tree; removing it here. Adds the parasail entry to provider_endpoints_support.json so the check_provider_folders_documented.py CI check passes (chat_completions and responses true; others false). * fix: normalize Anthropic passthrough server tool usage (#29827) * test(anthropic): cover server_tool_use dict cost tracking * fix: normalize Anthropic server tool usage (cherry picked from commit 982f726bed7d3ec05e463c5dd3d090bebae91d19) * fix: keep server tool usage subscriptable (cherry picked from commit 70280b9b272455b2f974d08bc697f67f929755bf) --------- Co-authored-by: Genmin <joey@joeyroth.com> * fix(proxy): fix typo generic_role_mappoings -> generic_role_mappings in ui_sso.py (#29753) Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> * feat(proxy): add disable_budget_reservation general setting (#27639) (#29493) * feat(proxy): add disable_budget_reservation general setting (#27639) * feat(proxy): register disable_budget_reservation in ConfigGeneralSettings (#27639) * docs(proxy): document disable_budget_reservation concurrency tradeoff (#27639) * ci: re-trigger flaky docker build (prisma generate ECONNRESET) * fix(proxy): warn and document budget enforcement tradeoff when disable_budget_reservation is set (#27639) * feat(gemini_tts): adding support to Gemini TTS languageCode parameters (#29623) * Adding support to Gemini TTS Language Code parameters * Mapping Gemini TTS languageCode param in Docstring * Use snake_case for language_code input keyMapping Gemini TTS languageCode param in Docstring * Restoring files modified under enterprise/litellm_enterprise due to lint/formatting checks --------- Co-authored-by: João Garrido <joaogarrido@google.com> * feat(guardrails): capture user and model metadata in CrowdStrike AIDR (#29517) * fix(proxy): require OpenAI path segment for shared Azure Cognitive Services domains Address Greptile review: the `.cognitiveservices.azure.com` / `.openai.azure.com` domains are shared by every Azure Cognitive Service (Speech, Vision, Language, ...), so a hostname-only substring match misclassified non-OpenAI Azure traffic as OpenAI routes. - Replace the substring host test with suffix matching (rejects look-alike domains like cognitiveservices.azure.com.attacker.example). - Add `_is_openai_compatible_url` that requires an OpenAI-style path marker (`/openai/` or `/v1/`) on the shared Azure domains, and use it in PassThroughEndpointLogging.is_openai_route (previously hostname-only). - Add negative tests for Azure Speech/Vision paths and look-alike domains. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix: support Responses input in Redis semantic cache (#29581) * fix: support responses input in redis semantic cache * test: cover redis semantic prompt extraction * test: handle blank redis semantic text fallbacks * chore: remove async cache dead statement * test: cover redis semantic cache miss paths * fix: filter sensitive cache lookup kwargs * chore: rerun ci after huggingface rate limit * chore(ui): regenerate dashboard API types (npm run gen:api) Sync src/lib/http/schema.d.ts with the proxy OpenAPI spec: adds the disable_budget_reservation general-settings field and picks up the RateLimitError docstring reindent. Fixes the gen:api CI drift check. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(bedrock): assert empty additionalModelRequestFields is omitted The Converse transformer now drops an empty additionalModelRequestFields block instead of sending it as `{}`. Update test_bedrock_top_k_param so models without top_k support (llama3) assert the key is absent rather than equal to an empty dict. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Kent <72616338+kingdoooo@users.noreply.github.com> Co-authored-by: codgician <15964984+codgician@users.noreply.github.com> Co-authored-by: Praveen Ghuge <95286176+pghuge-cloudwiz@users.noreply.github.com> Co-authored-by: Roi <roytev@gmail.com> Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Liam Scott <liam@uilliam.com> Co-authored-by: abhay23-AI <abhaytrivedi22@gmail.com> Co-authored-by: Ceder Dens <cederdens@gmail.com> Co-authored-by: 冯基魁 <56265583+fengjikui@users.noreply.github.com> Co-authored-by: Kai Huang <kaihuang724@gmail.com> Co-authored-by: rinto <54238243+ririnto@users.noreply.github.com> Co-authored-by: Genmin <joey@joeyroth.com> Co-authored-by: Arnav Bhilwariya <arnavbhilwariya0408@gmail.com> Co-authored-by: Armaan Sandhu <74664101+Ar-maan05@users.noreply.github.com> Co-authored-by: João Garrido <48538534+johngarrido@users.noreply.github.com> Co-authored-by: João Garrido <joaogarrido@google.com> Co-authored-by: Kenan Yildirim <kenan@kenany.me> Co-authored-by: Dávid Balatoni <balcsida@gmail.com>	2026-06-08 13:49:52 -07:00
Sameer Kankute	f5b11b72a6	feat(proxy): publish /v2/model/info in Swagger OpenAPI spec (#29900 ) * feat(proxy): publish /v2/model/info in Swagger OpenAPI spec Expose the v2 model info endpoint in /docs by removing include_in_schema=False and documenting query parameters used by the admin UI and proxy CLI consumers. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(ui): regenerate schema.d.ts for /v2/model/info OpenAPI docs Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-08 09:33:35 -07:00
Yassin Kortam	5e2db7eee4	feat(litellm): add models and repository layers (#29686 )	2026-06-06 20:59:33 -07:00
Mateo Wang	13924fa1d6	feat: standardize rate limit errors with category, rate_limit_type, model, and llm_provider fields (#27687 ) * feat(exceptions): add RateLimitErrorCategory + headers/detail fields on RateLimitError LiteLLM previously surfaced rate-limit conditions through several unrelated error classes (RateLimitError, FastAPI HTTPException(429), BaseLLMException). This commit adds the data model needed to consolidate them under a single class: * RateLimitErrorCategory enum exposing four categorical values (vendor_rate_limit, vendor_batch_rate_limit, litellm_rate_limit, litellm_batch_rate_limit) so callers can switch on the rate-limit source. * New optional fields on RateLimitError: - category (defaults to vendor_rate_limit, preserving today's behavior for every existing call site in exception_mapping_utils); - headers (preserves retry-after / rate_limit_type / reset_at across the proxy boundary instead of dropping them on the floor); - detail (mirrors FastAPI HTTPException.detail so the same instance can be serialized through both paths). litellm.RateLimitErrorCategory is re-exported at the package root to match the existing exception-export pattern. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * feat(proxy): add ProxyRateLimitError unifying RateLimitError + HTTPException Adds a single proxy-side error class that subclasses BOTH litellm.exceptions.RateLimitError AND fastapi.HTTPException via cooperative multiple inheritance. Why both bases: * Subclassing RateLimitError lets user code catch every rate-limit source with one 'except RateLimitError' and switch on the new .category field. * Subclassing HTTPException keeps every existing FastAPI plumbing path (the isinstance(e, HTTPException) branches in proxy_server.py route handlers, FastAPI's own dispatcher, and tests asserting pytest.raises(HTTPException)) working without modification, and preserves retry-after / rate_limit_type / reset_at headers on the wire. The class declaration order is (HTTPException, RateLimitError) so the MRO puts HTTPException's no-super-call __init__ ahead of openai's cooperative __init__ chain — preventing openai.APIError.super().__init__(message) from landing in HTTPException.__init__(status_code=message). LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * refactor(proxy/hooks): raise ProxyRateLimitError from budget + iteration limiters Replaces three bare HTTPException(status_code=429, ...) call sites with ProxyRateLimitError, which is both a RateLimitError (catchable by category) and an HTTPException (preserves existing FastAPI serialization). Drops the now-unused HTTPException import in the iteration / per-session limiters. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * refactor(proxy/hooks): raise ProxyRateLimitError from parallel-request limiters Replaces HTTPException(status_code=429, ...) call sites in the v1 and v3 parallel-request limiters (key/team/user/model/customer rate limits) with ProxyRateLimitError. Updates the raise_rate_limit_error helper's return type annotation accordingly. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * refactor(proxy/hooks): raise ProxyRateLimitError from dynamic rate limiters Replaces HTTPException(status_code=429, ...) call sites in the v1 and v3 dynamic rate limiters (project-level TPM/RPM allocation, model-saturation checks, priority-based limits, fail-closed guards) with ProxyRateLimitError. The v3 limiter still imports HTTPException for an unrelated bare 'except HTTPException:' branch. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * refactor(proxy/hooks): raise ProxyRateLimitError from batch rate limiter Replaces HTTPException(status_code=429, ...) in batch_rate_limiter._raise_rate_limit_error with ProxyRateLimitError tagged as RateLimitErrorCategory.LITELLM_BATCH_RATE_LIMIT so users can distinguish batch-level throttling (which counts requests/tokens across an uploaded batch input file before submission) from the generic key/team/user RPM/TPM limiter. The HTTPException import is retained because the same module raises HTTPException for unrelated 403/IO error paths. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(rate-limit): pin down unified rate-limit error contract Adds a dedicated test module covering the new RateLimitErrorCategory enum, RateLimitError.category default + override behavior, ProxyRateLimitError's dual nature (RateLimitError + HTTPException), and a parametrized regression guard that asserts every proxy hook module imports the unified class. The regression guard catches the failure mode the refactor is designed to prevent: someone re-introducing a bare HTTPException(status_code=429, ...) in one of the hook modules instead of going through ProxyRateLimitError. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * feat(logging): expose rate-limit category via StandardLoggingPayload Adds an optional 'error_rate_limit_category' field to StandardLoggingPayloadErrorInformation, populated from the unified RateLimitError.category attribute (introduced in the previous commits on this branch). Why: the .category attribute is reachable off the raw exception today via getattr(e, 'category', None), but the structured contract that downstream custom callbacks / loggers / spend log writers consume is the StandardLoggingPayload. Without this field, a user building custom rate-limit metrics on top of callback data has to special-case the raw exception object — which defeats the purpose of the StandardLoggingPayload abstraction. The field is None for non-rate-limit exceptions (so consumers can read it unconditionally without isinstance checks) and is one of the RateLimitErrorCategory string values otherwise. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(rate-limit): assert StandardLoggingPayload carries the category Five tests covering: vendor default, explicit litellm_rate_limit and litellm_batch_rate_limit values, None for non-rate-limit exceptions, and None when no exception is provided. Pins down the contract that custom callbacks can read 'error_information.error_rate_limit_category' off the StandardLoggingPayload to drive custom rate-limit metrics without ever reaching for the raw exception. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(types): silence mypy [misc] on intentional dual-base attr overlap mypy emits two [misc] errors on the ProxyRateLimitError class line because its two bases declare overlapping attributes with related-but-not-identical annotations: * status_code: int on starlette HTTPException vs. Literal[429] on openai's RateLimitError (every openai status-error subclass narrows it the same way and silences pyright with the same convention). * headers: Mapping[str, str] \| None on HTTPException vs. our Optional[ Dict[str, str]] (the proxy hooks always carry a stringified dict). Both narrowings are intentional and enforced at construction time. Add a type: ignore[misc] with an inline explanation rather than relax the annotations on the parent or change the wire-format guarantees. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(rate-limit): add direct hook-invocation tests to lift patch coverage Adds six end-to-end tests that drive each refactored hook past its limit and assert the unified ProxyRateLimitError is raised with the correct category and dual-base shape. Complements the import-shape-only parametrized guard above by actually executing the new 'raise ProxyRateLimitError(...)' lines so codecov's patch coverage sees them as hit. Hooks covered (one test each): * parallel_request_limiter v1 — direct call to raise_rate_limit_error() * parallel_request_limiter v3 — direct call to _handle_rate_limit_error with a fabricated OVER_LIMIT response * max_iterations_limiter — full async_pre_call_hook with mocked agent registry, second call exceeds budget=1 * max_budget_limiter — async_pre_call_hook with mocked get_current_spend * dynamic_rate_limiter v1 — async_pre_call_hook with mocked check_available_usage forcing available_tpm == 0 * batch_rate_limiter — direct _raise_rate_limit_error call, asserts category is the batch-specific LITELLM_BATCH_RATE_LIMIT (not the generic LITELLM_RATE_LIMIT) LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix: guard rate_limit_category extraction with isinstance check * test(rate-limit): cover remaining hook raise sites for codecov Adds five more direct hook-invocation tests so every PR-touched line in the proxy hooks is exercised by tests in tests/test_litellm/, which codecov measures: * parallel_request_limiter v1 — check_key_in_limits inline raise (the second raise site, separate from the raise_rate_limit_error helper covered earlier) * dynamic_rate_limiter v1 — RPM raise branch (TPM branch was already covered) * dynamic_rate_limiter v3 — parametrized over all three raise sites: model_saturation_check, priority_model, and the fail-closed fallback for an unrecognized descriptor_key * max_budget_per_session_limiter — full async_pre_call_hook with a mocked agent registry and over-budget cached spend All 42 tests in test_rate_limit_error_unification.py now pass and together exercise every changed import + raise line across the eight refactored proxy hooks. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix: use computed error_message in ProxyRateLimitError detail * fix(parallel-request-limiter): drop None from detail; annotate raise_rate_limit_error as NoReturn The v1 ' raise_rate_limit_error' helper built an unused 'error_message' variable and then assembled the actual ' detail' via an f-string that interpolated 'additional_details' verbatim — producing 'Max parallel request limit reached None' when invoked without arguments (flagged by code review). Fix the helper to: - use the constructed 'error_message' as the detail - annotate the helper as NoReturn since it always raises - drop the redundant 'raise'/'return' at the two call sites Add two regression tests covering both the with- and without- additional_details paths. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(proxy/hooks): drop literal 'None' from raise_rate_limit_error detail The v1 parallel_request_limiter's raise_rate_limit_error helper has a long-standing bug: it computes a None-guarded 'error_message' string but then ignores it and emits an f-string that interpolates the raw 'additional_details' arg. Callers that pass no argument get 'Max parallel request limit reached None' as the user-facing detail. This commit: * wires error_message into the detail kwarg so the None-guard actually applies and operators see a clean message; * changes the return-type annotation from ProxyRateLimitError to NoReturn (the function always raises) so type-checkers know callers after this invocation are unreachable. Greptile P1 + P2 review feedback on PR #27687. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(types): demote TypedDict floating string to a # comment A string literal placed after a field declaration in a TypedDict body is not a per-field docstring — it's an orphaned string expression Python discards. Tools like mypy / pyright that inspect TypedDict fields won't surface that text either. Move the documentation for error_rate_limit_category to a real comment so the intent is visible to readers and type-checker tooling without the misleading docstring framing. Greptile P2 review feedback on PR #27687. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * security(exceptions): do not auto-copy vendor response headers to e.headers A vendor 429 response can set arbitrary headers (Set-Cookie, CORS overrides, …). Previously, when RateLimitError was constructed with only a 'response=' (no explicit 'headers=' kwarg), self.headers fell back to a copy of response.headers. If a downstream proxy serializer ever forwarded e.headers to the client, a malicious upstream could inject browser-interpreted headers for the proxy origin. Drop the fallback. Only headers passed explicitly via the headers= kwarg make it onto self.headers (proxy hooks pass retry-after etc. — they control what's surfaced). Vendor response headers stay reachable on e.response.headers for callers that explicitly want them. Today's proxy_server.py route handlers don't actually forward e.headers on the wire (they construct ProxyException without passing headers), so no current behavior changes — this is a defensive narrowing so the fallback can never be turned into a vector when someone wires e.headers through later. Veria-AI security review feedback on PR #27687. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(rate-limit): regression guards for review-pass fixes Pins down the three review-pass fixes: * test_parallel_request_limiter_v1_helper_no_additional_details — calls raise_rate_limit_error() with no args and asserts the detail does NOT contain the literal string 'None'. Pre-fix, callers got 'Max parallel request limit reached None'. * test_rate_limit_error_does_not_auto_copy_response_headers — passes a vendor httpx.Response with a Set-Cookie header to RateLimitError WITHOUT an explicit headers= kwarg, asserts self.headers stays None (no leak), then re-checks that an explicit headers= kwarg DOES populate self.headers. Vendor headers remain reachable on e.response.headers for callers that explicitly want them. * The existing v1-helper test now also asserts the additional_details string makes it through to the detail. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * feat(rate-limit): add orthogonal RateLimitType (requests/tokens/concurrent_requests/budget/max_iterations) trho's last ask in the LIT-2968 thread: distinguish rate-limit failures by the dimension that was exceeded, not just by who rate-limited (vendor vs. litellm). Adds: - RateLimitType str-enum exposed at `litellm.RateLimitType` with values requests / tokens / concurrent_requests / budget / max_iterations. - `rate_limit_type` kwarg on litellm.RateLimitError + ProxyRateLimitError; None default so existing callers (vendor-429 path in exception_mapping_utils) remain a no-op. - StandardLoggingPayloadErrorInformation.error_rate_limit_type so custom callbacks can split rate-limit failures by cause without parsing free-text error messages. Mirror to error_rate_limit_category extraction in get_error_information(); single isinstance(RateLimitError) check covers both. - map_v3_rate_limit_type() helper to collapse the v3 limiter's internal labels ("requests", "tokens", "max_parallel_requests") onto the public enum so the v3 limiter and dynamic_rate_limiter_v3 share one mapping. Defensive None on unknown values rather than silently picking a wrong dimension. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * feat(proxy/hooks): wire rate_limit_type onto every limiter raise site Each refactored proxy hook now populates rate_limit_type with the dimension that actually tripped the limit, so downstream consumers (custom callbacks, prometheus exporters via the StandardLoggingPayload) can split key/team/user rate-limit failures by cause: - parallel_request_limiter (v1): detect dimension from current vs. limit in the post-cache branch (concurrent_requests > tokens > requests, matches the boolean condition order). Base case (current is None, one limit set to 0) picks the most-specific zero. raise_rate_limit_error() helper accepts an explicit rate_limit_type kwarg with CONCURRENT_REQUESTS default (matches every existing internal call site, including the global-limit branch). - parallel_request_limiter (v3): forward status["rate_limit_type"] through map_v3_rate_limit_type() so "max_parallel_requests" → CONCURRENT_REQUESTS for the public field while the raw v3 jargon stays on the HTTP header for wire-format backward compat. - dynamic_rate_limiter (v1): TPM-zero → TOKENS, RPM-zero → REQUESTS. Pass data["model"] through so callbacks see the model that hit the limit (addresses the secondary "provider missing" complaint in the original Slack thread, partially — the model is what dashboards typically split on). - dynamic_rate_limiter (v3): forward status["rate_limit_type"] via map_v3_rate_limit_type() at every raise site (model_saturation_check, priority_model, fail-closed unknown-descriptor guard). Also pass model. - batch_rate_limiter: limit_type is hard-typed "requests"\|"tokens" — map directly without going through the helper's None branch. - max_budget_limiter, max_budget_per_session_limiter: BUDGET. - max_iterations_limiter: MAX_ITERATIONS. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(rate-limit): cover RateLimitType enum, hook wiring, and StandardLoggingPayload propagation 27 new tests across five new test classes: - TestRateLimitType: enum exposed at litellm.RateLimitType, all five values defined, RateLimitError default is None (vendor 429 path makes no claim about which dimension), accepts both string and enum forms with str-coercion guarantee for downstream JSON serializers. - TestProxyRateLimitErrorType: ProxyRateLimitError default is None, accepts string or enum, doesn't break existing callers that pass nothing. - TestMapV3RateLimitType: pins each v3-internal → public-enum mapping (tokens, requests, max_parallel_requests → concurrent_requests, unknown → None) so a future v3 refactor can't silently swap dimensions. - TestStandardLoggingPayloadCarriesType: the new error_rate_limit_type field reaches the structured payload for both ProxyRateLimitError and plain RateLimitError, is None when unspecified, and is None for non-rate-limit exceptions (symmetric with error_rate_limit_category). - TestProxyHooksWireTypeCorrectly: drives the actual raise sites in the v1 parallel_request_limiter helper, the v3 _handle_rate_limit_error (both "tokens" and "max_parallel_requests" paths), and the batch limiter (both tokens and requests paths) — coverage tools see the new rate_limit_type= kwargs as exercised, not just the import shape. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(rate-limit): cover _coerce_message branches and v1 dimension detection Drives the patch coverage on the new orthogonal RateLimitType wiring up to (or close to) 100% on the touched files. ProxyRateLimitError._coerce_message — was 22% covered, now 100%: * nested {error: {message}} dict * nested {message: {message}} dict (alt key) * dict without 'error'/'message' keys → JSON dump fallback * non-JSON-serializable dict value → str() fallback * non-string non-mapping detail (int) → str() coercion v1 parallel_request_limiter dimension detection — was 0% covered, now exercised across 6 parametrized cases: * check_key_in_limits else-branch: current at concurrent / TPM / RPM cap → asserts rate_limit_type is concurrent_requests / tokens / requests. * check_key_in_limits base case (current is None): max_parallel_requests / tpm_limit / rpm_limit set to 0 → asserts the most-specific zero attribution wins per the helper's order. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * feat(proxy/hooks): add ProxyHTTPRateLimitError + provider resolver Introduces a small helper layer used by every proxy-side rate-limit hook so that the 429 they raise carries a populated llm_provider / model — instead of an empty exception.llm_provider that downstream loggers (Prometheus failure metric, observability callbacks) read as 'no provider attribution'. ProxyHTTPRateLimitError inherits from both fastapi.HTTPException (so the proxy server still renders it as a 429) and litellm.exceptions.RateLimitError (so isinstance checks and PrometheusLogger._get_exception_class_name pick up llm_provider). We deliberately don't call RateLimitError.__init__ — it constructs an httpx.Response we don't need and would just add failure surface; attribute parity is what downstream consumers care about. resolve_llm_provider_for_rate_limit() wraps litellm.get_llm_provider defensively. Internal limiter hooks fire from async_pre_call_hook — well before get_llm_provider runs anywhere else in the request lifecycle — so we have to call it ourselves at raise time. If the model is missing or unparseable (alias, router-only model) we fall back to llm_provider='litellm_proxy' rather than letting a second exception leak out and break the request path. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(proxy/hooks): populate llm_provider on parallel-request 429s Both v1 and v3 parallel-request limiters fired bare HTTPException(429) from inside async_pre_call_hook. The downstream Prometheus failure metric reads exception.llm_provider via _get_exception_class_name — the empty value showed up as exception_class='HTTPException' and left model_id='None' on the time series. Threads requested_model through every raise site in: * parallel_request_limiter.py: - check_key_in_limits (the per-key/per-model/per-user/per-team/ per-customer over-limit path) - raise_rate_limit_error (zero-limit + global_max_parallel_requests paths) — now takes an optional requested_model kwarg * parallel_request_limiter_v3.py: - _handle_rate_limit_error (the OVER_LIMIT translator), called from both the should_rate_limit pre-check and the TPM reservation path Resolved via resolve_llm_provider_for_rate_limit so unknown / missing models silently fall back to llm_provider='litellm_proxy' instead of breaking the request path with a second exception. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(proxy/hooks): populate llm_provider on dynamic-rate-limit 429s Same plumbing change as the parallel limiters, applied to both dynamic_rate_limiter (v1) and dynamic_rate_limiter_v3: * v1: TPM-zero and RPM-zero paths in async_pre_call_hook now resolve data['model'] -> (model, llm_provider) once and pass it into both raises. * v3: All three raise sites in _check_rate_limits — the model_saturation_check enforced raise, the priority_model enforced raise, and the fail-closed unknown-descriptor branch — now attribute the 429 to the actual provider. Falls back to llm_provider='litellm_proxy' when the model can't be resolved. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(proxy/hooks): populate llm_provider on batch-rate-limit 429s batch_rate_limiter._raise_rate_limit_error now takes a requested_model kwarg threaded from data['model'] in _check_and_increment_batch_counters. The batch-creation 429 is what gets raised when the input file's tokens/requests count would push the per-key TPM/RPM window over its limit. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(proxy/hooks): populate llm_provider on budget/iterations 429s Final batch of internal raise sites — the user/session-budget and max-iterations hooks. Same pattern: resolve data['model'] once at raise time, attach to ProxyHTTPRateLimitError so Prometheus and observability callbacks can attribute the 429. Hooks updated: * max_budget_limiter (per-user max_budget exceeded) * max_iterations_limiter (per-session agent iteration cap) * max_budget_per_session_limiter (per-session dollar cap) All three fall back to llm_provider='litellm_proxy' when data['model'] is missing or unparseable. Drops the now-unused HTTPException import from each module. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(proxy/hooks): pin provider field on internal rate-limit 429s Regression coverage for the 'provider field missing' bug across every proxy-side rate-limit hook + the helper layer: * ProxyHTTPRateLimitError class shape (HTTPException + RateLimitError, dict-detail stringification, None-provider normalization). * resolve_llm_provider_for_rate_limit happy paths (gpt-4o-mini, anthropic/..., bedrock/...) plus all three fallback branches (None, '', unknown name) plus a 'get_llm_provider raises' case that asserts we swallow the secondary exception. * For each limiter (parallel v1/v3, dynamic v1/v3, batch, max_budget, max_iterations, max_budget_per_session): assert the raised exception is a RateLimitError carrying the resolved model + llm_provider, and a sibling test that asserts the fallback path returns 'litellm_proxy' without leaking a second exception. * Two PrometheusLogger._get_exception_class_name pins so the Prometheus failure metric label flips from 'HTTPException' to 'Openai.ProxyHTTPRateLimitError' (or 'Litellm_proxy.' on fallback) — that's what dashboards consume. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> perf(proxy/hooks): defer provider resolution to over-limit branches * fix: use error_message in raise_rate_limit_error to avoid literal 'None' in detail * Consolidate rate_limiter_utils imports in dynamic_rate_limiter * fix(proxy): set num_retries/max_retries on ProxyHTTPRateLimitError ProxyHTTPRateLimitError inherits from RateLimitError but did not call RateLimitError.__init__, so num_retries/max_retries were never set. When Starlette's HTTPException lacks __str__, MRO falls through to RateLimitError.__str__, which unconditionally reads these attributes and raises AttributeError during logging/traceback formatting. Initialize them to None defensively. * fix(mypy): silence base-class status_code conflict on ProxyHTTPRateLimitError HTTPException declares 'status_code: int' while openai.RateLimitError (via APIStatusError) declares 'status_code: Literal[429] = 429'. Mypy flags the multi-base override as [misc] in CI lint. The runtime semantics are fine (we set self.status_code in __init__), so silence the class-level annotation conflict with a targeted ignore. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix: annotate batch limiter _raise_rate_limit_error as NoReturn * feat(prometheus): rate-limit category/type labels + exception_class back-compat (follow-up to #27687) (#27706) * feat(prometheus): add rate_limit_category and rate_limit_type labels Adds two new labels to litellm_proxy_failed_requests_metric so dashboards can split 429s by rate-limit source (vendor vs. litellm-internal) and by the dimension that was exceeded (requests/tokens/concurrent_requests/ budget/max_iterations) without parsing free-text error messages. Closes the Prometheus side of LIT-2718. The unified RateLimitError.category and .rate_limit_type fields landed in PR #27687 but were only surfaced on StandardLoggingPayload (custom-callback channel); this exposes them on the metric label set as well. Both labels are populated only when the underlying exception is a litellm.RateLimitError; non-rate-limit failures keep them empty. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * feat(prometheus): populate rate-limit labels + preserve exception_class back-compat Two coupled changes in the Prometheus integration: 1. async_post_call_failure_hook now extracts the new RateLimitError .category / .rate_limit_type fields (added in PR #27687) via a _extract_rate_limit_labels helper and forwards them through UserAPIKeyLabelValues onto litellm_proxy_failed_requests_metric. Empty for non-rate-limit failures. 2. _get_exception_class_name special-cases ProxyRateLimitError and keeps emitting 'HTTPException' for the exception_class label. Without this shim, ProxyRateLimitError (which multi-inherits from HTTPException + RateLimitError) would silently flip the label from 'HTTPException' (the historical value for proxy-side 429s) to 'ProxyRateLimitError', breaking existing dashboards / alerts that key off exception_class='HTTPException'. Distinguishing vendor vs. litellm 429s is now the job of the new rate_limit_category label. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(prometheus): cover rate-limit labels and exception_class back-compat Adds 19 tests across: - enum / label-list registration - _extract_rate_limit_labels for vendor RateLimitError, ProxyRateLimitError, non-rate-limit and None inputs (incl. parametrized over every RateLimitErrorCategory x RateLimitType combo) - _get_exception_class_name back-compat: ProxyRateLimitError keeps the legacy 'HTTPException' string while vendor RateLimitError keeps the historical 'Provider.ClassName' format - end-to-end through async_post_call_failure_hook with both ProxyRateLimitError and vendor RateLimitError, asserting both new labels populate and exception_class stays back-compat Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(prometheus): tolerate missing fastapi in lazy ProxyRateLimitError import Address greptile feedback: - async_post_call_failure_hook docstring: drop the stale labelnames listing and reference PrometheusMetricLabels.litellm_proxy_failed_requests_metric as the source of truth so the doc cannot drift from the actual labelset. - _get_exception_class_name: guard the lazy ProxyRateLimitError import with ImportError so router-side fallback callsites don't blow up in non-proxy installs that don't have fastapi (a transitive dep of proxy.common_utils.proxy_rate_limit_error). Behavior is unchanged when fastapi is available. Also fix the existing enterprise callback test that asserted the old labelset on litellm_proxy_failed_requests_metric — it now expects the new rate_limit_category / rate_limit_type labels populated for vendor 429s. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(bugbot): simplify rate-limit label coercion + guard None detail - prometheus.py _extract_rate_limit_labels: RateLimitError.__init__ already normalizes category/rate_limit_type to plain str, so the getattr(.value) + isinstance dance was dead code. Reduce to str(value) if not None. - proxy_rate_limit_error.py _coerce_message: short-circuit None to '' instead of falling through to str(None) = 'None', which produced the literal message 'litellm.RateLimitError: None'. * fix(rate-limit): surface unified category/type fields on BudgetExceededError The most common budget cap (virtual-key max_budget enforcement in auth_checks.py) raises litellm.BudgetExceededError, a bare Exception subclass that bypassed the unified rate-limit error class introduced by PR #27687. Custom callbacks reading StandardLoggingPayload.error_information saw category=None and rate_limit_type=None for these 429s, missing the most common budget case (team / org / end-user budgets all hit the same code path). Surface the fields off BudgetExceededError as plain attributes: - category = RateLimitErrorCategory.LITELLM_RATE_LIMIT - rate_limit_type = RateLimitType.BUDGET - llm_provider = "" (or caller-supplied) Switch get_error_information and _extract_rate_limit_labels from isinstance(RateLimitError) gating to duck-typed attribute reads, guarded by membership in the rate-limit enums so unrelated third-party exceptions exposing a .category attribute can't leak garbage values into the payload. This is strictly additive: BudgetExceededError keeps its bare-Exception base class, so `except BudgetExceededError:` handlers keep firing and `except RateLimitError:` does not start catching budget errors. * fix(rate-limit): validate enum membership at duck-typed read sites + enrich BudgetExceededError llm_provider Two follow-ups uncovered during the second QA pass on PR #27687: 1. Guard third-party `.category` / `.rate_limit_type` attribute leakage. The duck-typed read in `get_error_information` and `_extract_rate_limit_labels` would forward any string attribute named `category` / `rate_limit_type` on an unrelated third-party exception into the StandardLoggingPayload and Prometheus labels — silently mislabeling custom-callback payloads and blowing out Prometheus label cardinality. Add `validate_rate_limit_category` / `validate_rate_limit_type` helpers that gate on the documented enum value sets; non-matching values are dropped to None. 2. Enrich BudgetExceededError.llm_provider from request_data. Budget checks live in tenant-scoped helpers (key / team / org / tag / end-user / project) that don't see the request model, so the BudgetExceededError they raise carried llm_provider="" — leaving custom-metrics consumers without provider attribution for the most common 429 case. Resolve it once at the central UserAPIKeyAuthExceptionHandler seam, before post_call_failure_hook fires, so the StandardLoggingPayload the callback sees has the same provider attribution as RPM/TPM 429s. Regression tests pin both: 4 leakage tests + 4 enrichment tests. The leakage tests would fail under the pre-validation version of either read site; the enrichment tests would fail if the handler skipped the resolver call. * fix(rate-limit): resolve router model_name aliases to real provider (#27914) * fix(rate-limit): resolve router model_name aliases to real provider For nearly every real LiteLLM proxy deployment the request model is a router model_name alias (e.g. 'tpm-locked' -> litellm_params.model: openai/gpt-4o-mini), and 'litellm.get_llm_provider' doesn't know about router aliases — it raises 'LLMProviderNotProvidedError'. The resolver then fell through to the defensive 'litellm_proxy' fallback, so the 'llm_provider' field this PR adds was effectively always 'litellm_proxy' in the field, defeating its purpose for the most common proxy configuration. Add a router-alias fallback step: when 'get_llm_provider' raises, scan the active 'llm_router.model_list' for a deployment whose 'model_name' matches the request model and resolve from its 'litellm_params.model' instead. If multiple deployments share the same alias (load-balancing case) the first one wins — every deployment under one alias should agree on provider in any sensible config, and 'first' is deterministic so the Prometheus label stays stable. Defensive throughout: an uninitialized router, a malformed deployment, a 'litellm_params.model' that itself fails 'get_llm_provider' — every branch falls through to the existing 'litellm_proxy' fallback rather than letting a secondary exception escape and mask the rate-limit error we're trying to surface. Tests: - test_router_alias_resolves_to_underlying_provider: alias 'tpm-locked' -> 'openai/gpt-4o-mini' produces provider='openai', model='gpt-4o-mini'. - test_router_alias_with_multiple_deployments_uses_first. - test_router_alias_unknown_falls_back. - test_router_alias_with_malformed_deployment_falls_back. - Existing fallback test updated to also stub 'litellm.proxy.proxy_server.llm_router' so it exercises the full 'no resolution anywhere' path. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(rate-limit): harden router alias resolver + test isolation - Wrap _resolve_provider_from_router_alias loop in top-level try/except so a non-iterable model_list / unexpected deployment shape can't escape and mask the 429 with a 500. - Type-check litellm_params before .get() to handle non-dict truthy values. - Patch llm_router=None in the parametrized fallback test so a router left by another test in the session can't redirect the unknown-model path. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(bugbot): preserve "BudgetExceededError" Prometheus label Adding llm_provider to BudgetExceededError (so callbacks get provider attribution from StandardLoggingPayload) made the provider-prefix step in _get_exception_class_name silently flip the label from "BudgetExceededError" to e.g. "Openai.BudgetExceededError", breaking dashboards keyed on the historical value. Short-circuit BudgetExceededError in _get_exception_class_name the same way ProxyRateLimitError already is. Provider/category attribution still lands on the new rate_limit_category / rate_limit_type labels. * test: fix invalid 'rpm' rate_limit_type in v3 limiter test mocks The v3 rate limiter only emits 'requests', 'tokens', or 'max_parallel_requests'. Using 'rpm' caused map_v3_rate_limit_type to return None, leaving the expected RateLimitType.REQUESTS untested. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(bugbot): hoist provider resolver + opt-in prom rate-limit labels - dynamic_rate_limiter.py: hoist resolve_llm_provider_for_rate_limit above the TPM/RPM if/elif so the lookup runs once per request, matching the pattern in dynamic_rate_limiter_v3.py. - prometheus.py: gate the new rate_limit_category / rate_limit_type labels on litellm_proxy_failed_requests_metric behind litellm.prometheus_emit_rate_limit_labels (default False). Mirrors the existing prometheus_emit_stream_label opt-in. Preserves the metric's pre-unification label set so existing dashboards / recording rules keep matching after upgrade; operators can enable the new labels once downstream consumers include them. - Tests updated: default-off back-compat case, opt-in path enables the flag before asserting label presence. * fix: stabilize prometheus label sets and drop redundant model normalization - Cache PrometheusLogger.get_labels_for_metric per metric_name so that the label set used to construct counters at __init__ time stays in sync with the label set used at increment time, even if module-level toggles like prometheus_emit_rate_limit_labels or prometheus_emit_stream_label are flipped at runtime. Without this, toggling these flags after the logger was created would cause ValueError from prometheus_client because the runtime labels would not match the counter's declared labelnames. - Drop redundant 'model or ""' guard in ProxyRateLimitError.__init__ where model is already normalized one step earlier. Co-authored-by: Yassin Kortam <yassin@berri.ai> * perf(dynamic_rate_limiter): only resolve provider when rate limit hit Co-authored-by: Yassin Kortam <yassin@berri.ai> * test(prometheus): clear cached metric labels after toggling rate-limit flag The PrometheusLogger caches each metric's label set at construction time so that labels used at counter.labels(...) time stay consistent with the labels the metric was registered with. The enterprise async_post_call_failure_hook test toggles litellm.prometheus_emit_rate_limit_labels = True AFTER the fixture has already built the logger, so without invalidating the cache the rate_limit_category / rate_limit_type labels never reach the mocked counter and the assert_called_once_with check fails. Co-authored-by: Yassin Kortam <yassin@berri.ai> * test: fix CI failures from prom label cache + flaky time-window assertion PrometheusLogger.get_labels_for_metric now caches the per-metric label set at first read so the labels passed to counter.labels(...) stay in lock step with the labels the counter was registered with. This broke two existing test patterns: - test_prometheus_labels.py: tests bind the real method onto a MagicMock, but MagicMock auto-creates a Mock for _cached_metric_labels whose .get(...) returns a truthy Mock — treated as a populated cache and returned as the label set, producing empty filtered labels and KeyError on labels["requested_model"] / ["route"]. Seed real {} containers for _cached_metric_labels and label_filters before binding. - test_prometheus_logging_callbacks.py::test_set_team_budget_metrics_with_custom_labels: the fixture builds the logger before the test monkeypatches litellm.custom_prometheus_metadata_labels, so the cached label set never picks up the new metadata labels. Clear the cache after the monkeypatch (same pattern already used for the rate-limit toggle in test_async_post_call_failure_hook). UI: view_logs/index.test.tsx "Last Minute" window assertion is off by one at the minute boundary. start_date is floored to the minute, so the dropped sub-minute fraction can push the truncated-seconds diff up to (minMinutes+1)60 exactly when the click lands near a minute rollover. Switch the upper bound to toBeLessThanOrEqual. feat(otel-v2): surface rate_limit_category + rate_limit_type on failed LLM-call spans PR #28909 introduced the typed v2 OTel engine that builds spans from StandardLoggingPayload, with SpanError carrying error_type + message and the genai mapper stamping error.type onto every failed LLM-call span. This PR's earlier commits added error_rate_limit_category and error_rate_limit_type to the same StandardLoggingPayload.error_information the v2 engine reads — but neither field reached a span attribute, so v2 OTel traces stayed opaque about why a 429 fired (vendor vs litellm, RPM vs TPM vs concurrent vs budget vs max_iterations) even after the custom-callback and prometheus surfaces gained that decomposition. Three coupled changes: 1. semconv.py: add LiteLLM.ERROR_RATE_LIMIT_CATEGORY / LiteLLM.ERROR_RATE_LIMIT_TYPE under the litellm.* vendor namespace (no GenAI semconv equivalent exists for who-rate-limited / which-dimension). 2. payloads.py: extend SpanError with rate_limit_category + rate_limit_type, populated by _parse_error() from the same error_information.error_rate_limit_* fields the custom-callback channel and prometheus rate_limit_category / rate_limit_type labels read. Single source of truth across all three observability surfaces. 3. mappers/genai.py: stamp the two attributes on the LLM-call span when present. drop_none guarantees they stay absent (not 'None') for non-rate-limit failures so trace consumers can read them unconditionally. Three regression tests in test_otel_v2_emitter.py pin: a vendor / litellm-internal RateLimitError lands category=litellm_rate_limit + rate_limit_type=requests on the span; a BudgetExceededError lands rate_limit_type=budget; a non-rate-limit failure (BadRequestError) keeps the rate_limit_* attributes absent. Mutation-tested against reverting either the SpanError extension or the _parse_error read site — both new tests fail under either mutation. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test: align prometheus user-budget + logs quick-select tests with merged code The merge into this branch left two test patterns out of step with the code they exercise. test_set_user_budget_metrics_includes_user_email_and_alias_labels_when_opted_in flipped litellm.prometheus_user_budget_label_include_email_alias after the fixture had already built the PrometheusLogger. get_labels_for_metric now snapshots each metric's label set at construction time, so the runtime flip no longer reached the cached labels. Enable the flag before constructing the logger, matching how the proxy applies config at startup. view_logs/index.test.tsx referenced uiSpendLogsCall and moment without importing them, and the merged index.tsx now fetches through useLogFilterLogic (the hook the file stubs out) rather than calling uiSpendLogsCall directly. Add the imports and restore the real hook for the Quick Select window assertions so the call is actually observed. * refactor(otel/v2): drop rate-limit decomposition from the LLM-call span Proxy-side rate limits (litellm_rate_limit, budget, max_iterations) are rejected at the gate before any upstream call, so async_post_call_failure_hook tags the synthetic failure log with LITELLM_LOGGING_NO_UPSTREAM_LLM_CALL and the v2 OTel logger never opens an LLM-call span for them; the litellm.error.rate_limit_category / litellm.error.rate_limit_type attributes were dead for exactly the cases they were meant to surface. The only failure that does open an LLM-call span carrying a RateLimitError is a vendor 429, where rate_limit_type is always None and the category just restates error.type=RateLimitError. The decomposition still reaches downstream consumers through StandardLoggingPayload.error_information.error_rate_limit_* and the prometheus rate_limit_category / rate_limit_type labels, both unchanged. Removes the SpanError fields, the _parse_error reads, the genai mapper attributes, the semconv keys, and the three span tests that asserted a scenario that never reaches the mapper in production. * fix(batch_rate_limiter): map max_parallel_requests to concurrent_requests * refactor(prometheus): drop transitive fastapi import from _get_exception_class_name Read the legacy exception_class label from a prometheus_exception_class_name marker on ProxyRateLimitError instead of importing the proxy module, keeping the integrations layer free of a transitive fastapi dependency. * chore(ui): sync schema.d.ts with unified rate-limit error spec The ProxyRateLimitError docstring flows into the proxy OpenAPI spec's 429 response description, so the generated dashboard types were out of sync. Regenerated via npm run gen:api (Check UI API Types Sync). --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> Co-authored-by: Yassin Kortam <yassin@berri.ai>	2026-06-06 17:50:29 -07:00
ryan-crabbe-berri	f31d059aa3	feat(ui): add budget duration to edit team member form (#29717 ) * feat(ui): add budget duration to edit team member form Editing a team member created a member budget with no duration, so the budget never reset. This threads a budget reset period through the edit flow end to end and reuses the shared duration dropdown so the options stay in sync with the rest of the UI. Resolves LIT-2651 * fix(proxy): validate member budget_duration and persist clears Reject budget_duration values that can't be parsed, are non-positive, or overflow date math before any write, so a bad value can't be persisted and later crash the budget reset job. Clearing the budget duration in the edit-member form now sends null and clears the column end to end, so the dropdown's clear control reflects a real change instead of being a no-op * chore(ui): regenerate schema.d.ts for member budget_duration Adds budget_duration to TeamMemberUpdateRequest/Response in the generated dashboard types so the Check UI API Types Sync gate passes	2026-06-06 17:24:55 -07:00
Mateo Wang	aeb55e7a11	fix(mcp): highlight MCP cards red when the logged-in user is missing per-user env vars (#29856 ) * fix(mcp): flag missing per-user env vars on the card for every accessible server The dashboard MCP card grid lists servers via the registry-backed manager (get_all_mcp_servers_unfiltered for admins in view_all mode, the allowed-context aggregation otherwise), but the per-user env-var status endpoint that drives the red "user fields missing" highlight resolved servers through the much narrower get_all_mcp_servers_for_user, which only returns servers explicitly granted on the calling key. An admin's dashboard session key carries no per-server MCP grant, so the status feed came back empty and the card never turned red even when the logged-in user had not filled in their required variables. Both surfaces now share a single _resolve_accessible_mcp_servers helper, so the status feed is computed over exactly the cards the user sees. The helper returns servers unredacted; the status endpoint needs the raw env_vars and still only ever reports is_set booleans, never the stored secret values. * test(mcp): drop dead get_all_mcp_servers_for_user patch from view_all regression test The bulk status endpoint resolves servers through _resolve_accessible_mcp_servers now, so the old get_all_mcp_servers_for_user patch in the admin view_all regression test is never hit. Removing it keeps the test honest about which code path it exercises.	2026-06-06 16:51:25 -07:00
Mateo Wang	d61f7747c0	feat(bedrock): forward strict and additionalProperties to Converse toolSpec (#29814 ) * feat(bedrock): forward strict and additionalProperties to Converse toolSpec Bedrock Converse supports strict in toolSpec since 2026-02, but _bedrock_tools_pt only whitelisted type/properties/required/name/description, so strict: true was silently dropped and Claude-on-Bedrock ignored enum constraints that GPT and direct-Anthropic honored. Forward strict from the OpenAI function and additionalProperties from the schema (Bedrock requires the latter alongside strict), passing each only when present. https://claude.ai/code/session_01WQjWd8NfUB3vxERwudbHkv * fix(bedrock): only forward strict tool schemas to Claude on Converse Nova, Llama and GPT-OSS on Bedrock reject the strict field (BedrockException 'This model doesn't support the strict field'), and the GPT-OSS request-body test asserts strict/additionalProperties are stripped. Forwarding them to every model broke the llm_translation suite, so gate the forwarding on the anthropic base model since only Claude honours strict tool schemas on Bedrock.	2026-06-06 16:28:18 -07:00
milan-berri	273855b4e2	fix(responses-bridge): map system-only chat request to system input item (#29817 ) System-only chat requests mapped the system message to instructions and left input=[], which OpenAI's Responses API rejects (it also rejects input=""). When no other messages are present, carry the system message as a role:"system" input item (single copy, correct role) instead of leaving input empty. Mirrors the existing handling of non-string system content. Fixes Open WebUI new-conversation failures on mode:responses Codex models. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-06 16:11:54 -07:00
Yassin Kortam	68d67212cd	fix: 400 on Anthropic context overflow; seed identity on failed auth (#29848 )	2026-06-06 14:57:41 -07:00
Mateo Wang	33c363d4d4	Extend the record/replay proxy to chat, embeddings, moderations, rerank, and Anthropic (#29847 ) * test(ci): extend record/replay proxy to chat, embeddings, moderations, rerank, anthropic The record/replay proxy that took the gpt-image-1 spend E2E off the live OpenAI path now fronts every provider, so the other real-provider E2Es stop paying for and depending on live calls each commit. It keys per upstream and selects a non-OpenAI provider by a /__recorder_upstream/<host>/ path prefix carried on the model's api_base, since some litellm handlers (cohere rerank) drop custom request headers. Wired into build_and_test (chat, embeddings, moderations, image), the otel job (cohere rerank), and the anthropic-messages job via a reusable start_openai_record_replay_proxy command. Dropped the time.time()/uuid prompt cache-busters in the build_and_test chat tests, whose config has the response cache off, so identical requests are recordable. The image spend test now asserts a repeat call still bills spend, failing loudly if the proxy response cache is ever turned on. Responses, the anthropic passthrough, bedrock, and fake-endpoint tests are left live: their lifecycles, api_base assertions, providers, or fake targets make a stateless body-keyed cache either break them or add nothing. * docs(ci): note the recorder command's OpenAI default upstream and prefix override Addresses a review note: the shared start_openai_record_replay_proxy command defaults the upstream to OpenAI, so a non-OpenAI model must carry the /__recorder_upstream/<host>/ prefix on its api_base. Document that in the command description so a future caller does not assume the default follows the provider.	2026-06-06 14:33:42 -07:00
Shivam Rawat	fdade8a84e	Title: fix(proxy): resolve vector store file list credentials from team deployments (#29739 ) * fix(proxy): resolve vector store file list credentials from team deployments GET /v1/vector_stores/{id}/files now uses the same router credential routing as POST, including JWT team model hints and wildcard model selectors, so list requests no longer call OpenAI with Bearer None. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(proxy): authorize model hints and fix credential routing for vector store file list Resolves three review findings on the vector store file list path. Authorize user-controlled model hints (?model= query param and the x-litellm-model header) against the key's and team's allowed models via can_key_call_model / _can_object_call_model before any deployment credentials are resolved, closing a model access bypass where a normal key could file-list using a restricted deployment's provider credentials. Run the managed vector store registry resolution before the model routing hint so the managed store sets the routing model first; the hint resolver then selects credentials matching that model instead of a team fallback deployment, avoiding a credential/model mismatch across deployments. Skip team-fallback deployments whose provider cannot be determined instead of treating them as OpenAI, so a deployment without an explicit custom_llm_provider or "openai/" prefix no longer has its credentials injected. * fix(proxy): enforce vector store file model auth Ensure vector store file listing routes authorize explicit and inferred model routing before resolving deployment credentials. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(proxy): type guard vector store model hints Keep vector store model hint authorization typed to string-only values so static checks pass. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-06 12:36:05 -07:00
Shivam Rawat	1fbb78d2a4	Title: Fix managed batch cancel credential resolution (#29734 ) * Fix managed batch cancel credential resolution Decode unified batch IDs before cancel routing and resolve litellm_credential_name to api_key in Router._acancel_batch so JWT team-scoped deployments cancel with the same credentials used at create time Co-authored-by: Cursor <cursoragent@cursor.com> * fix batch cancellation credential cleanup Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-06 12:35:18 -07:00
Mateo Wang	51769a8ede	feat(fal_ai): add Nano Banana / Gemini 2.5 Flash Image generation support (#29798 ) * feat(fal_ai): add Nano Banana / Gemini 2.5 Flash Image generation support Adds a FalAINanoBananaConfig for fal.ai's Nano Banana models, exposed under both fal-ai/nano-banana and fal-ai/gemini-25-flash-image (identical schema). This is the migration path for fal-ai/imagen4, which fal deprecates on 2026-06-30. The config derives the request endpoint from the model name so both aliases route correctly, maps OpenAI image params to the fal schema (n -> num_images, size -> nearest supported aspect_ratio, response_format ignored since the model returns URLs), and reuses the base fal response parser. Pricing is registered at 0.039 per image in the cost map and backup. * fix(fal_ai): tighten nano-banana routing and guard mapped params Match the specific gemini-25-flash-image / gemini-2.5-flash-image aliases instead of any model containing gemini so future fal.ai Gemini-branded models aren't silently misrouted to the nano-banana config. Guard the param mapping on the fal-side keys (num_images, aspect_ratio) so a pre-set mapped value is respected and an OpenAI key is never forwarded unmapped. * fix(fal_ai): drop non-existent gemini-2.5-flash-image routing alias fal.ai only serves the dotted-free fal-ai/gemini-25-flash-image and fal-ai/nano-banana endpoints. Routing the dotted gemini-2.5-flash-image alias built a https://fal.run/fal-ai/gemini-2.5-flash-image URL that fal.ai 404s and had no pricing entry, so spend tracking silently fell to zero. Match only the two real endpoint slugs.	2026-06-06 11:16:44 -07:00
Mateo Wang	b3297fc2ea	feat(proxy): hot-reload .env in dev when running with --reload (#29783 ) * feat(proxy): hot-reload .env in dev when running with --reload The --reload watcher already restarts the worker on .py and --config YAML edits, but .env was unwatched, so changing a key there did nothing until a manual restart. Add .env to the uvicorn reload_includes (and to the StatReload monkeypatch, which ignores reload_includes) so an edit triggers a worker restart. A reloaded worker is a fresh process that inherits the reloader's environment, so load_dotenv(override=False) would keep serving the stale inherited value for any key already in the environment. The CLI now exports LITELLM_DEV_ENV_HOT_RELOAD when --reload is set, and litellm/__init__.py reads it to load .env with override=True only on that dev path, leaving normal startup precedence untouched. feat(proxy): warn that --reload makes .env override shell env vars When --reload is active, worker processes re-read .env with override=True, so .env values win over shell-exported environment variables. Surface this dotenv precedence change with a startup warning so a developer who relies on a shell-exported override is not silently surprised. * fix(proxy): type reload helper paths as Optional[str] to satisfy mypy * fix(proxy): watch the cwd .env in both reload backends for parity WatchFiles only watches cwd (and the --config dir) for .env, while the StatReload fallback used find_dotenv(usecwd=True), which walks up to a parent-dir .env that WatchFiles never sees. Point StatReload at the same cwd .env so the two reload backends react to the same file.	2026-06-06 09:39:21 -07:00
Mateo Wang	aa7845dc5e	test(ci): make the image-gen record/replay proxy report cache mode and per-request HIT/MISS (#29802 ) The recorder could come up pointed at a missing or unreachable cassette redis and silently forward every request live; the health check still passed and the process logged nothing, so a CI run looked identical whether it replayed from the cassette or paid OpenAI for a fresh call every commit. There was no way to tell from the logs whether the 24h caching was actually happening. It now announces its mode at startup (REPLAY when the cassette redis is reachable, PASSTHROUGH when CASSETTE_REDIS_URL is unset, DEGRADED when it is set but the redis is unreachable) and logs a HIT/MISS line per request. _cache_set returns whether the write landed so a mid-run redis failure surfaces as a warning instead of masquerading as a successful record. Adds unit tests covering the three startup modes and the HIT/MISS/not-recorded request paths; both new behaviors were mutation-checked.	2026-06-06 09:36:06 -07:00
tin-berri	22186f457a	fix(ui): persist Tools-tab MCP OAuth token to DB (#29809 )	2026-06-05 22:29:56 -07:00
Mateo Wang	4ec4ab99d0	feat(mcp): per-server env vars with global + per-user scopes (#28917 )	2026-06-05 20:15:11 -07:00
yuneng-jiang	53cf3d8416	fix(proxy): drop deleted team BYOK model name from team.models (#29820 ) Deleting a team-scoped BYOK model left its public name in team.models, so /models with a team key kept listing the now-deleted "ghost" model. delete_model stripped team.models using only litellm_modeltable alias lookups, but models added via /model/new with a team_id never create an alias row; their public name lives only in team.models and model_info.team_public_model_name, so it was never removed. The team cache was also left stale because the delete path skipped _refresh_cached_team. The cleanup now keys off team_public_model_name (falling back to alias keys), runs after the deployment row is deleted, and strips a public name only when no remaining team deployment still backs it, so a load-balanced replica is not revoked and concurrent deletes cannot leave a ghost. The updated team row is refreshed in cache so /models reflects the change immediately	2026-06-05 18:35:50 -07:00
milan-berri	b7f47a3b52	fix(jwt): use resolved DB user_id for spend on legacy email match (#29217 ) * fix(jwt): attribute spend to resolved DB user_id on email/sso fuzzy match When user_id_upsert is enabled with JWT auth and a pre-migration user row exists whose user_email matches the JWT email but whose user_id is a UUID, get_user_object resolves the legacy row via fuzzy lookup, but the JWT-claim user_id (the email) still flowed into team-membership lookup, JWTAuthBuilderResult.user_id, UserAPIKeyAuth and the spend tables. Spend was orphaned under a phantom email id; /user/info and the Usage page showed $0 for the legacy user (GH #26789). Treat the resolved user_object as the source of truth: add _canonical_user_id_from_db, rebind inside get_objects, and return effective_user_id so auth_builder unpacks it without adding statements. Fixes #26789 Co-authored-by: Cursor <cursoragent@cursor.com> * fix(jwt): log user_id rebind at DEBUG to avoid email PII in INFO streams Greptile review on #29217: rebinding often logs JWT email claims at INFO. Co-authored-by: Cursor <cursoragent@cursor.com> * test(jwt): update passthrough allowlist mock for 5-tuple get_objects Staging #29256 added a test that still mocked get_objects with a 4-tuple; our PR expanded the return to 5 values (effective_user_id). Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-05 15:59:41 -07:00
Sameer Kankute	95e3d136e1	test(google): add google-genai SDK proxy integration tests (#29781 ) * test(google): add google-genai SDK proxy integration tests for Gemini and Vertex Pin google-genai in the CI dependency group and exercise streaming/non-streaming generate_content through the LiteLLM proxy in the existing unified_google_tests suite. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(test): address Greptile review for google-genai proxy SDK tests Restore GOOGLE_APPLICATION_CREDENTIALS after the module proxy fixture tears down, initialize temp-file tracking on the proxy SDK base class, and skip litellm reload for proxy_genai_sdk tests so the module-scoped proxy server stays consistent. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(test): only load Vertex credentials when keys exist for proxy SDK tests Avoid writing empty GOOGLE_APPLICATION_CREDENTIALS temp files so Vertex tests skip cleanly without credentials, use a session-scoped proxy fixture, and clean up per-test credential temp files. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(test): scope google-genai pin to unified_google_tests only Remove google-genai from the ci dependency group and pin it in tests/unified_google_tests/requirements.txt for local test installs. Co-authored-by: Cursor <cursoragent@cursor.com> * test(google): tie litellm reload skip to proxy fixture dependency Replace the name-based reload guard with a check on whether the test requests the google_genai_proxy_url fixture, so the skip stays correct if the proxy SDK tests are renamed. * fix(test): stop DatabaseURLSettings tests leaking DATABASE_URL into os.environ The autouse env scrubber relied on monkeypatch.delenv, but apply_to_env writes DATABASE_URL straight into os.environ, which monkeypatch never tracks and therefore never undoes. The synthesized writer.example.com URL leaked past the last test in this module and into proxy-infra tests that read DATABASE_URL to decide whether to hit a real database, e.g. test_deprecated_key_grace_period_cache_hit_path, turning an intended skip into a ConnectError. Snapshot and restore the managed vars directly so the original environment is reinstated regardless of how it was mutated. * test(google): drop redundant per-test vertex credential setup The session-scoped google_genai_proxy_url fixture already configures GOOGLE_APPLICATION_CREDENTIALS before the proxy starts, and _require_proxy_sdk skips when credentials are missing, so the per-test _setup_vertex_credentials_if_needed helper and its temp-file tracking never did any work. Remove it to keep the ABC self-contained. * test(google): declare model_config contract on proxy SDK ABC _skip_reason_if_credentials_missing reads self.model_config to pick the provider, but that property was only declared on the sibling BaseGoogleGenAITest. Make the dependency explicit by adding model_config as an abstract property on BaseGoogleGenAIProxySDKTest so the ABC is self-contained and a standalone subclass fails fast instead of hitting an AttributeError. * test(google): narrow streaming error catch to Exception Catching BaseException in the streaming assertion swallowed KeyboardInterrupt and SystemExit, turning a Ctrl-C into a test failure message instead of letting pytest interrupt cleanly. Only genuine runtime errors should be recorded as stream failures, so catch Exception. * test(google): initialize proxy on the same loop that serves it The proxy was initialized via asyncio.run() on the main thread, which creates and tears down a throwaway event loop, while requests were served on a separate loop in the worker thread. Any asyncio primitive bound to the init loop would be unusable once serving started. Run initialize() on the worker thread's loop right before server.serve() so setup and request handling share a single event loop. * test(google): drop redundant google-genai requirements pin google-genai>=1.37.0,<2.0 is already declared in the proxy-runtime extra, which the google_generate_content_endpoint_testing CI job installs via uv sync --all-extras. The standalone tests/unified_google_tests/requirements.txt duplicated that pin with a narrower ==1.37.0 specifier and was never installed by CI, so it added a second source of truth without changing what gets installed. Drop it and rely on the proxy-runtime extra. * chore: revert incidental uv.lock exclude-newer bump The google-genai ci pin was added and then dropped (it is already provided by the proxy-runtime group), but each uv lock recomputed the relative exclude-newer span, leaving only a timestamp bump in uv.lock. Restore it to the base value so this test-only PR carries no lockfile change. --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com>	2026-06-05 21:05:32 +00:00
Sameer Kankute	d671a09c20	Litellm oss staging 050626 (#29774 ) * Mark xAI models retiring on 2026-05-15 (#28788) Per https://docs.x.ai/developers/migration/may-15-retirement, xAI is retiring the following slugs on 2026-05-15 (auto-redirect to grok-4.3 with various reasoning efforts; callers continuing to use the old slugs will be billed at grok-4.3 pricing): grok-4-1-fast-reasoning{,-latest} -> grok-4.3 (low effort) grok-4-1-fast-non-reasoning{,-latest} -> grok-4.3 (none) grok-4-fast-reasoning -> grok-4.3 (low effort) grok-4-fast-non-reasoning -> grok-4.3 (none) grok-4-0709 -> grok-4.3 (low effort) grok-code-fast-1{,-0825} -> grok-build-0.1 grok-3 -> grok-4.3 (none) Only the direct xai/ slugs are tagged; third-party hosts (azure_ai, oci, vercel_ai_gateway, perplexity/xai) run their own schedules. The grok-3 retirement list explicitly names only the base grok-3 slug — the -mini / -fast / -beta / -latest variants are not listed, so they remain untouched. * feat(moonshot): advertise json_schema response support on live models (#29683) litellm.responses() already routes Moonshot through the responses->chat-completions bridge, and Moonshot honors response_format json_schema on chat completions. The cost-map entries left supports_response_schema unset, so discovery layers that gate on that flag dropped Moonshot from structured-output / responses listings even though the capability works end to end. Set supports_response_schema on the nine models currently live on api.moonshot.ai: kimi-k2.5, kimi-k2.6, the moonshot-v1 8k/32k/128k text and vision-preview variants, and moonshot-v1-auto. Verified against the live API that each honors json_schema and that litellm.responses() returns schema-valid structured output through the bridge. * chore(moonshot): mark models retired from api.moonshot.ai as deprecated (#29685) Thirteen Moonshot/Kimi models in the cost map no longer resolve on api.moonshot.ai (all return 404). Stamp each with its deprecation_date from platform.kimi.ai/docs/models rather than deleting the entries, so historical cost calculation keeps resolving the names while tooling can surface the retirement. Dates: kimi-thinking-preview 2025-11-11; kimi-latest and its 8k/32k/128k context variants 2026-01-28; the kimi-k2 preview/turbo/thinking series 2026-05-25; the moonshot-v1 -0430 snapshots use their own 2024-04-30 snapshot date (Moonshot publishes no discontinuation date for them). * fix(moonshot): drop temperature for reasoning models (kimi-k2.5/k2.6) (#29687) Kimi reasoning models reject every temperature except 1; a request with temperature=0.2 returns "invalid temperature: only 1 is allowed for this model". litellm only clamped temperature into [0.3, 1], so any value below 1 still 400'd. Drop the temperature param entirely for reasoning models (gated on supports_reasoning, the same signal transform_request already uses) so the model default is used; the non-reasoning moonshot-v1 models keep the existing clamp. Co-authored-by: Sameer Kankute <sameer@berri.ai> * feat(mcp): add per-server timeout configuration (#29672) * feat(mcp): add per-server timeout configuration * fix(mcp): address timeout field review comments - use is not None guard instead of or for 0.0 edge case - copy timeout in both LiteLLM_MCPServerTable constructions (health check path + _build_mcp_server_table) - add timeout Float? column to all three schema.prisma files - extend round-trip test to cover _build_mcp_server_table direction - add test for zero timeout not treated as falsy * fix(mcp): forward timeout in _build_temporary_mcp_server_record * fix(mcp): return 504 instead of 500 when per-server timeout fires * test(mcp): add 504 timeout regression test; fix black formatting * Add jp. Bedrock cross-region inference profile for claude-opus-4-7 (#28567) * fix(thinking): handle None thinking param in is_thinking_enabled (#28598) Squash-merged by litellm-agent from Terrajlz's PR. * feat(helm): support tpl rendering in podAnnotations (#28609) Squash-merged by litellm-agent from devauxbr's PR. * Forward custom_llm_provider through the Responses API bridge (Fixes #28505) (#28575) * Forward custom_llm_provider through the Responses API bridge (Fixes #28505) When a Chat Completions request to a GPT-5.4+ model contains both `tools` and `reasoning_effort`, `completion()` auto-routes through `responses_api_bridge`. The bridge handler called `litellm.responses()` / `litellm.aresponses()` without forwarding the already-resolved `custom_llm_provider`, so the downstream call re-invoked `get_llm_provider()` with `custom_llm_provider=None` and stripped a second provider prefix from a `provider/provider/model` deployment string. For a deployment configured as `openai/openai/openai/gpt-5.5`, the bridge flow sent `openai/gpt-5.5` to the upstream API instead of the correct `openai/openai/gpt-5.5`. Upstream APIs that enforce model-name allow-lists rejected this as `key_model_access_denied`. Fix: pass the locally-resolved `custom_llm_provider` into both the sync `responses()` and async `aresponses()` calls so the downstream `_resolve_model_provider_for_responses` sees an explicit provider and skips the second prefix-strip. New regression test `tests/test_litellm/completion_extras/test_responses_bridge_provider_propagation.py` pins both call sites: each must forward `custom_llm_provider`. * fix(28505): set custom_llm_provider on request_data instead of as duplicate kwarg Greptile flagged that the previous patch passed custom_llm_provider as an explicit kwarg to responses()/aresponses() while request_data already carried it via the spread of sanitized_litellm_params, which would raise TypeError: got multiple values for keyword argument on every real bridge call. Switches to assigning request_data['custom_llm_provider'] before the call so the resolved provider wins over whatever sanitized_litellm_params spread in, without duplicating the kwarg. Updates the regression test to seed request_data with a sentinel custom_llm_provider so it actually exercises the overwrite path (the previous test mocked transform_request with a minimal dict and never hit the conflict). * chore: trigger shin-agent re-eval on retargeted staging base * chore: trigger shin-agent re-eval against updated Greptile state * Add jp. Bedrock cross-region inference profile for claude-opus-4-7 AWS Bedrock documents jp.anthropic.claude-opus-4-7 alongside the existing us./eu./au./global. profiles for Claude Opus 4.7 (ap-northeast-1 Tokyo / ap-northeast-3 Osaka), but the entry is missing from model_prices_and_context_window.json. Tokyo-region users currently get an "unknown model" error when routing through the JP geo profile. Adds the entry to both the canonical file and the bundled backup, mirroring the recent pattern for sonnet-4-6 (#27831). Pricing matches the other regional profiles (10% premium over base/global). Regression test pins all six documented profiles (base, global, us, eu, au, jp) and asserts pricing parity between jp. and au. variants. Source: https://docs.aws.amazon.com/bedrock/latest/userguide/model-card-anthropic-claude-opus-4-7.html --------- Co-authored-by: Terrajlz <info@jouleselectrictech.com> Co-authored-by: Bruno Devaux <devaux.br@gmail.com> Co-authored-by: Sameer Kankute <sameer@berri.ai> * feat(soniox): add soniox audio transcription integration (#29508) * feat(openmeter): add OPENMETER_TRUST_REQUEST_USER to prevent forged attribution (#29650) The OpenMeter callback resolves the CloudEvent subject from kwargs["user"] first, then falls back to the key-bound user_api_key_user_id. For multi-tenant proxy deployments, a client can set `"user": "..."` in the request body and cause their usage to be attributed to that arbitrary string — a billing-attribution forgery risk. Adds OPENMETER_TRUST_REQUEST_USER env var (default "true" for backward compatibility). When set to "false", the request-supplied `user` field is ignored and the subject is resolved solely from user_api_key_user_id. Matches the existing env-var-driven config pattern in this file (OPENMETER_API_KEY, OPENMETER_API_ENDPOINT, OPENMETER_EVENT_TYPE). * feat(search): add you_com as a search provider (#28370) * feat(search): add you_com as a search provider Registers You.com Search API as a first-class `search_provider` in the `search_tools` registry, alongside Tavily, Exa, Perplexity, etc. - New adapter: litellm/llms/you_com/search/transformation.py - POSTs to https://ydc-index.io/v1/search - Auth: X-API-Key from YOUCOM_API_KEY (or explicit api_key) - Maps Perplexity unified spec: max_results -> count, search_domain_filter -> include_domains, country -> country - Flattens results.web + results.news into a single SearchResult list; snippet prefers snippets[0], falls back to description; page_age -> date - Registry: SearchProviders.YOU_COM in litellm/types/utils.py and wired into ProviderConfigManager.get_provider_search_config() - Pricing entry: model_prices_and_context_window.json (placeholder $0.0; happy to adjust to maintainers' preferred public number) - Docs: example router config snippet and example proxy yaml updated - Tests: tests/search_tests/test_you_com_search.py - 5 mocked tests (payload shape, domain filter mapping, snippet fallback, news flattening, missing-api-key error) Refs upstream expansion signal: #15942 * review fixups: normalize api_base, lowercase country, scope env-var to test Addresses Greptile inline review comments on #28370: - get_complete_url: strip trailing slashes from api_base before the endswith("/v1/search") check, so a custom base like ".../v1/search/" doesn't become ".../v1/search/v1/search". - transform_search_request: .lower() country before sending, matching Tavily's convention so callers using the unified spec form ("US") get consistent behavior across providers. - Tests: replace direct os.environ writes with an autouse monkeypatch fixture so YOUCOM_API_KEY is set per-test and removed afterwards. The missing-key test now uses monkeypatch.delenv. New test asserts the trailing-slash normalization above. Reverts the ARCHITECTURE.md / example yaml edits per the reviewer note that documentation changes belong in the litellm-docs repo. * support keyless free tier (api.you.com/v1/agents/search) as default You.com offers an IP-throttled keyless endpoint that returns the same response shape as the keyed one (~100 queries/day, no signup). This is a significant onboarding lever - mirrors the keyless DuckDuckGo/SearXNG providers already in the search_tools registry. Behavior: - YOUCOM_API_KEY set -> keyed: POST https://ydc-index.io/v1/search (X-API-Key header) - no key -> free: POST https://api.you.com/v1/agents/search (no auth) - YOUCOM_API_BASE override -> honored as-is Tests: - New: test_you_com_search_keyless_free_tier - asserts URL + absence of X-API-Key when no key is configured. - New: test_you_com_search_validate_environment_keyless - asserts the config no longer raises when the key is absent. - Removed: test_you_com_search_raises_without_api_key (the precondition no longer holds). - Existing payload/domain-filter/etc tests still cover keyed mode via the autouse YOUCOM_API_KEY fixture. Verified both endpoints accept POST + return identical JSON shape: results.web[] / results.news[] with title, url, snippets, description, page_age. * register you_com in provider_endpoints_support.json Adding `litellm/llms/you_com/` requires a corresponding entry in provider_endpoints_support.json or the code-quality/check_provider_folders_documented CI check fails. Follows the compact tavily/serper pattern - endpoints: { search: true }. Local run of the check now reports "All 114 provider folders are documented". * move tests under tests/test_litellm/llms/ so CI exercises them The litellm CI workflows scope unit tests to `tests/test_litellm/...` (see test-unit-llm-providers.yml: `tests/test_litellm/llms` path), so tests living under `tests/search_tests/` are never run in CI - which is why codecov reports 0% patch coverage for the new adapter even though the unit tests exist and pass locally. Move test_you_com_search.py into `tests/test_litellm/llms/you_com/` so the test-unit-llm-providers job picks it up. 7/7 tests still pass at the new location. (Sibling search-only providers - tavily, exa_ai, brave, etc. - still live only in `tests/search_tests/` and would benefit from the same move, but that is out of scope for this PR.) * fix(you_com): pin Accept-Encoding: identity to dodge keyless gzip bug The keyless free-tier endpoint (api.you.com/v1/agents/search) advertises Content-Encoding: gzip but returns a body that httpx's decoder rejects with `zlib.error: Error -3 while decompressing data: incorrect header check`, surfacing as litellm.APIConnectionError in user code. curl works because it doesn't request compression by default. Pin Accept-Encoding: identity in validate_environment so the upstream server skips compression entirely. Harmless on the keyed endpoint (ydc-index.io/v1/search) which negotiates content-encoding correctly. The header uses setdefault so a caller-supplied Accept-Encoding still takes precedence. (Server-side bug has been flagged to the You.com team separately - once fixed there, this workaround can be removed.) New unit test: test_you_com_search_pins_identity_accept_encoding. --------- Co-authored-by: Sameer Kankute <sameer@berri.ai> * docs: fix README typo (#29419) Correct clear spelling mistakes in documentation without changing behavior. Confidence: high Scope-risk: narrow Tested: git diff --check; uvx codespell on changed files Not-tested: Full docs build not run; text-only changes * Fix(langfuse): pass httpx_client to Langfuse in langfuse_prompt_management to respect SSL_VERIFY (#29480) * fix(langfuse): pass ssl_verify to Langfuse httpx client * fix_langfuse_ * add unit tests * addressed comments --------- Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> * feat(models): add minimax/MiniMax-M3 to model cost map (#29412) Add MiniMax's new flagship MiniMax-M3 to the native minimax provider: 512K context, 128K max output, native multimodal (supports_vision), reasoning, prompt caching. Pricing (USD/M tokens): input 0.6 / output 2.4 / cache read 0.12. M3 has no active prompt-cache-write tier, so cache_creation_input_token_cost is omitted. Updated both the root model_prices_and_context_window.json (remote source) and the bundled litellm/model_prices_and_context_window_backup.json (local fallback), keeping them in sync. * fix(logging): handle ResponseCompletedEvent in anthropic_messages streaming spend log (#29394) * fix(logging): handle ResponseCompletedEvent in anthropic_messages streaming spend log * fix(logging): extend terminal event handling to ResponseIncompleteEvent and ResponseFailedEvent; fix return type annotation * feat(provider): Add Neosantara provider as OpenAI Compatible (#29646) * Add Neosantara provider * Register Neosantara provider enum * Address Neosantara provider review feedback * Add Neosantara packaged endpoint support --------- Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> * fix: address greptile and veria review feedback - langfuse: guard httpx_client injection behind version check (>= 2.7.3) - soniox: propagate audio_transcription_duration in _hidden_params for spend tracking - soniox: give SONIOX_API_BASE env var priority over caller-supplied api_base - mcp: replace CancelledError catch with asyncio.wait_for + TimeoutError * chore(mcp): add migration for per-server timeout column * fix(test): add tool_use_system_prompt_tokens to model prices schema validator * fix: mcp timeout test uses real asyncio.wait_for timeout; you_com get_complete_url respects resolved api_key * fix: forward resolved api_key into you_com endpoint selection and apply timeout to soniox polling GETs The search flow resolves api_key in validate_environment but never passed it into get_complete_url, so a programmatic api_key (with no YOUCOM_API_KEY in the env) set the X-API-Key header yet still selected the keyless free-tier endpoint. Forward api_key through both the search entrypoint and the http handler so the keyed endpoint is chosen. HTTPHandler.get/AsyncHTTPHandler.get had no timeout parameter, so the Soniox poll and transcript-fetch GETs silently used the client global default instead of the caller timeout. Add a per-request timeout to get() and forward the configured timeout from the Soniox handler. * fix(soniox): price stt-async-v4 per second so transcriptions are billed The handler stores audio_transcription_duration in _hidden_params, but the model carried only token cost fields and the response has no token usage, so the transcription cost path fell through to cost_per_second and returned $0. An authenticated caller could transcribe Soniox audio without decrementing their budget. Switch the entry to output_cost_per_second at Soniox's published $0.10/hour async rate so the stored duration produces a real charge. * fix(langfuse): use a dedicated httpx client for the SDK injection The httpx_client handed to the Langfuse SDK came from _get_httpx_client(), which returns LiteLLM's globally cached HTTPHandler. If Langfuse closed that client on teardown it would invalidate the shared client used by every other LiteLLM HTTP call. Build a dedicated httpx.Client instead, still resolving SSL verification and client certificate from LiteLLM's configuration. * fix(soniox): prefer caller-supplied api_base over SONIOX_API_BASE env var * fix(cohere): support max_completion_tokens on cohere v2 chat (default route) (#29779) * fix(cohere): support max_completion_tokens on cohere v2 chat The default cohere_chat route resolves to CohereV2ChatConfig, which did not list or map max_completion_tokens, so get_optional_params raised UnsupportedParamsError for the standard OpenAI parameter (the modern replacement for the deprecated max_tokens). The v1 config already maps it to cohere's max_tokens; mirror that in v2 and add v2 regression tests. * fix(cohere): make max_completion_tokens take precedence over max_tokens on v2 When both max_tokens and max_completion_tokens are supplied, prefer max_completion_tokens explicitly rather than relying on dict iteration order, and cover both orderings with a regression test. --------- Co-authored-by: Daniel Yudelevich <4537920+yudelevi@users.noreply.github.com> Co-authored-by: hectorc98 <hector.chamorroalvarez@adyen.com> Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com> Co-authored-by: Terrajlz <info@jouleselectrictech.com> Co-authored-by: Bruno Devaux <devaux.br@gmail.com> Co-authored-by: Dan Lemon <dan@danlemon.com> Co-authored-by: Saswat <saswatds@users.noreply.github.com> Co-authored-by: Brian Sparker <brainsparker@users.noreply.github.com> Co-authored-by: Zhao73 <156770117+Zhao73@users.noreply.github.com> Co-authored-by: Urain Ahmad Shah <60431964+urainshah@users.noreply.github.com> Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: kape <168134658+kapelame@users.noreply.github.com> Co-authored-by: danisalvaa <159898202+danisalvaa@users.noreply.github.com> Co-authored-by: Just R <remixingmagelang@gmail.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com> Co-authored-by: abhay23-AI <abhaytrivedi22@gmail.com>	2026-06-05 13:51:51 -07:00
Mateo Wang	84247d954d	test(ci): record/replay OpenAI image gen so the spend E2E isn't outage-bound (#29787 ) * test(ci): record/replay OpenAI image gen so the spend E2E isn't outage-bound The dockerized spend test test_key_info_spend_values_image_generation curls the proxy for a gpt-image-1 image, which wildcard-routes to real api.openai.com on every commit; an OpenAI outage then reddens unrelated PRs and each run pays for an image. Add an in-repo record/replay reverse proxy (tests/_openai_record_replay_proxy.py) that sits between the proxy and OpenAI. The first run, and the first after the recording lapses, records live; subsequent runs replay from the shared Redis cassette store. The proxy keeps its real separate-process HTTP topology; only the image model's api_base is pointed at the recorder in CI via IMAGE_GEN_RECORDER_BASE_URL, which is unset elsewhere so it falls back to api.openai.com. Recordings lapse 24h after write and are never refreshed on read, matching the VCR persister contract, so provider drift is still caught. Replayed responses drop upstream framing/server headers (content-length, transfer-encoding, content-encoding, date, server) so the re-serving layer recomputes them, honoring the Bedrock content-length lesson. * test(ci): close recorder http client on app shutdown Add a Starlette lifespan that closes the self-created httpx.AsyncClient on teardown, and leave caller-injected clients untouched so reuse across create_app calls is not broken. Covers the unclosed-client ResourceWarning raised in review.	2026-06-05 10:27:23 -07:00
Mateo Wang	939cff0455	test(vcr): stop refreshing cassette TTL on read so cassettes lapse after 24h (#29784 ) The Redis cassette persister slid the 24h TTL forward on every successful read, so any cassette replayed at least once per day never expired. With CI running more than once a day that means a recorded response is replayed forever and the suite never re-hits the provider, so a changed request or response contract goes undetected indefinitely. Drop the refresh-on-read. The TTL now counts down from the last write, so a cassette lapses 24h after it was recorded and the next run past that point re-records live and catches provider drift. Per-commit runs in between still replay from cache; only the one boundary-crossing run goes live.	2026-06-05 10:22:41 -07:00
Sameer Kankute	074455c138	fix(auth): expand all-team-models sentinel in can_key_call_model for batch validation (#29746 ) * fix(auth): expand all-team-models sentinel in can_key_call_model Keys with models=["all-team-models"] were denied during batch JSONL model validation because can_key_call_model matched the literal string against the model name. Add _resolve_key_models_for_auth_check to expand the sentinel to team_models before the check, consistent with get_key_models in model_checks.py and the completion-route bypass. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(auth): document empty team_models unrestricted access behavior; add regression test Adds a docstring note to _resolve_key_models_for_auth_check explaining that when team_models is empty, all-team-models resolves to [] which is treated as unrestricted access (consistent with get_key_models behavior on other auth paths). Adds a test to lock in this behavior. * fix(auth): deny all-team-models access when key has no team_id A key configured with models=["all-team-models"] but no team_id could previously resolve to an empty allowlist, which _check_model_access_helper treats as unrestricted access. Now the sentinel is only expanded when team_id is set; otherwise the unresolved sentinel stays in the model list and causes a deny (no real model name matches it). Same fix applied to get_key_models in model_checks.py for consistency across batch and non-batch auth paths. * style: black format model_checks.py * Fix batch all-team-models auth * style: black format batch_rate_limiter.py * fix(test): add tool_use_system_prompt_tokens to model prices schema validator * fix(batch): catch get_team_object errors to avoid 404 escaping batch auth * fix(batch): apply per-member model scope check after team auth in batch validation * Fail closed on batch team auth fetch errors * test(batch): cover team_object grant and member-scope denial in batch auth --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>	2026-06-05 09:04:45 -07:00
Sameer Kankute	89f177b7b6	fix(galileo): use ingest traces API and standard logging payload (#29651 ) * fix(galileo): use ingest traces API and standard logging payload Switch hosted Galileo logging to /ingest/traces with nested trace/span payloads, read metrics from standard_logging_object, and include cost and total tokens on trace metrics. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(galileo): route username/password auth to v2 traces ingest Hosted Galileo no longer serves /observe/ingest; JWT login should post the same trace payload to /v2/projects/{project_id}/traces. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(galileo): address Greptile review on logging and timestamps Use debug-level logs for per-request Galileo callback messages and fall back to start_time/end_time when standard_logging_object omits startTime/endTime. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(galileo): add Galileo to proxy UI callback configuration Expose Galileo in the admin callback selector and config APIs so credentials can be configured through the dashboard instead of YAML only. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(galileo): align response type logging with Langfuse Mirror Langfuse input/output handling for rerank, speech, transcription, realtime, pass-through, and other response types so Galileo ingest no longer skips supported call types. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(galileo): redact trace payload in debug logs and format with black Avoid logging prompts and model responses in flush debug output while keeping structural metadata for troubleshooting. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(galileo): stop logging full trace payload in debug output Log only flush URL and trace count so prompts and model responses are not written to application logs when debug logging is enabled. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix Galileo token totals and prompt messages --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-05 09:03:17 -07:00

1 2 3 4 5 ...

9972 Commits