litellm

Author	SHA1	Message	Date
Yassin Kortam	5e2db7eee4	feat(litellm): add models and repository layers (#29686 )	2026-06-06 20:59:33 -07:00
Mateo Wang	118176f21a	refactor(bedrock): build Converse toolSpec via a BedrockToolSpec dict subclass (#29869 )	2026-06-06 20:34:31 -07:00
yuneng-jiang	3448bf79f8	fix(ui): default guardrails page to first tab for admins, not submitted (#29872 ) The Guardrails page hardcoded defaultActiveKey="submitted", so admins landed on the "Submitted Guardrails" tab (the last of their four tabs) instead of the primary view. The original intent was for non-admins, whose only tab is Submitted Guardrails, to default there; admins should open on their first tab. Make the default role-aware: admins default to the first tab (Guardrail Garden), non-admins keep Submitted Guardrails.	2026-06-07 01:10:17 +00:00
Mateo Wang	13924fa1d6	feat: standardize rate limit errors with category, rate_limit_type, model, and llm_provider fields (#27687 ) * feat(exceptions): add RateLimitErrorCategory + headers/detail fields on RateLimitError LiteLLM previously surfaced rate-limit conditions through several unrelated error classes (RateLimitError, FastAPI HTTPException(429), BaseLLMException). This commit adds the data model needed to consolidate them under a single class: * RateLimitErrorCategory enum exposing four categorical values (vendor_rate_limit, vendor_batch_rate_limit, litellm_rate_limit, litellm_batch_rate_limit) so callers can switch on the rate-limit source. * New optional fields on RateLimitError: - category (defaults to vendor_rate_limit, preserving today's behavior for every existing call site in exception_mapping_utils); - headers (preserves retry-after / rate_limit_type / reset_at across the proxy boundary instead of dropping them on the floor); - detail (mirrors FastAPI HTTPException.detail so the same instance can be serialized through both paths). litellm.RateLimitErrorCategory is re-exported at the package root to match the existing exception-export pattern. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * feat(proxy): add ProxyRateLimitError unifying RateLimitError + HTTPException Adds a single proxy-side error class that subclasses BOTH litellm.exceptions.RateLimitError AND fastapi.HTTPException via cooperative multiple inheritance. Why both bases: * Subclassing RateLimitError lets user code catch every rate-limit source with one 'except RateLimitError' and switch on the new .category field. * Subclassing HTTPException keeps every existing FastAPI plumbing path (the isinstance(e, HTTPException) branches in proxy_server.py route handlers, FastAPI's own dispatcher, and tests asserting pytest.raises(HTTPException)) working without modification, and preserves retry-after / rate_limit_type / reset_at headers on the wire. The class declaration order is (HTTPException, RateLimitError) so the MRO puts HTTPException's no-super-call __init__ ahead of openai's cooperative __init__ chain — preventing openai.APIError.super().__init__(message) from landing in HTTPException.__init__(status_code=message). LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * refactor(proxy/hooks): raise ProxyRateLimitError from budget + iteration limiters Replaces three bare HTTPException(status_code=429, ...) call sites with ProxyRateLimitError, which is both a RateLimitError (catchable by category) and an HTTPException (preserves existing FastAPI serialization). Drops the now-unused HTTPException import in the iteration / per-session limiters. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * refactor(proxy/hooks): raise ProxyRateLimitError from parallel-request limiters Replaces HTTPException(status_code=429, ...) call sites in the v1 and v3 parallel-request limiters (key/team/user/model/customer rate limits) with ProxyRateLimitError. Updates the raise_rate_limit_error helper's return type annotation accordingly. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * refactor(proxy/hooks): raise ProxyRateLimitError from dynamic rate limiters Replaces HTTPException(status_code=429, ...) call sites in the v1 and v3 dynamic rate limiters (project-level TPM/RPM allocation, model-saturation checks, priority-based limits, fail-closed guards) with ProxyRateLimitError. The v3 limiter still imports HTTPException for an unrelated bare 'except HTTPException:' branch. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * refactor(proxy/hooks): raise ProxyRateLimitError from batch rate limiter Replaces HTTPException(status_code=429, ...) in batch_rate_limiter._raise_rate_limit_error with ProxyRateLimitError tagged as RateLimitErrorCategory.LITELLM_BATCH_RATE_LIMIT so users can distinguish batch-level throttling (which counts requests/tokens across an uploaded batch input file before submission) from the generic key/team/user RPM/TPM limiter. The HTTPException import is retained because the same module raises HTTPException for unrelated 403/IO error paths. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(rate-limit): pin down unified rate-limit error contract Adds a dedicated test module covering the new RateLimitErrorCategory enum, RateLimitError.category default + override behavior, ProxyRateLimitError's dual nature (RateLimitError + HTTPException), and a parametrized regression guard that asserts every proxy hook module imports the unified class. The regression guard catches the failure mode the refactor is designed to prevent: someone re-introducing a bare HTTPException(status_code=429, ...) in one of the hook modules instead of going through ProxyRateLimitError. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * feat(logging): expose rate-limit category via StandardLoggingPayload Adds an optional 'error_rate_limit_category' field to StandardLoggingPayloadErrorInformation, populated from the unified RateLimitError.category attribute (introduced in the previous commits on this branch). Why: the .category attribute is reachable off the raw exception today via getattr(e, 'category', None), but the structured contract that downstream custom callbacks / loggers / spend log writers consume is the StandardLoggingPayload. Without this field, a user building custom rate-limit metrics on top of callback data has to special-case the raw exception object — which defeats the purpose of the StandardLoggingPayload abstraction. The field is None for non-rate-limit exceptions (so consumers can read it unconditionally without isinstance checks) and is one of the RateLimitErrorCategory string values otherwise. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(rate-limit): assert StandardLoggingPayload carries the category Five tests covering: vendor default, explicit litellm_rate_limit and litellm_batch_rate_limit values, None for non-rate-limit exceptions, and None when no exception is provided. Pins down the contract that custom callbacks can read 'error_information.error_rate_limit_category' off the StandardLoggingPayload to drive custom rate-limit metrics without ever reaching for the raw exception. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(types): silence mypy [misc] on intentional dual-base attr overlap mypy emits two [misc] errors on the ProxyRateLimitError class line because its two bases declare overlapping attributes with related-but-not-identical annotations: * status_code: int on starlette HTTPException vs. Literal[429] on openai's RateLimitError (every openai status-error subclass narrows it the same way and silences pyright with the same convention). * headers: Mapping[str, str] \| None on HTTPException vs. our Optional[ Dict[str, str]] (the proxy hooks always carry a stringified dict). Both narrowings are intentional and enforced at construction time. Add a type: ignore[misc] with an inline explanation rather than relax the annotations on the parent or change the wire-format guarantees. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(rate-limit): add direct hook-invocation tests to lift patch coverage Adds six end-to-end tests that drive each refactored hook past its limit and assert the unified ProxyRateLimitError is raised with the correct category and dual-base shape. Complements the import-shape-only parametrized guard above by actually executing the new 'raise ProxyRateLimitError(...)' lines so codecov's patch coverage sees them as hit. Hooks covered (one test each): * parallel_request_limiter v1 — direct call to raise_rate_limit_error() * parallel_request_limiter v3 — direct call to _handle_rate_limit_error with a fabricated OVER_LIMIT response * max_iterations_limiter — full async_pre_call_hook with mocked agent registry, second call exceeds budget=1 * max_budget_limiter — async_pre_call_hook with mocked get_current_spend * dynamic_rate_limiter v1 — async_pre_call_hook with mocked check_available_usage forcing available_tpm == 0 * batch_rate_limiter — direct _raise_rate_limit_error call, asserts category is the batch-specific LITELLM_BATCH_RATE_LIMIT (not the generic LITELLM_RATE_LIMIT) LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix: guard rate_limit_category extraction with isinstance check * test(rate-limit): cover remaining hook raise sites for codecov Adds five more direct hook-invocation tests so every PR-touched line in the proxy hooks is exercised by tests in tests/test_litellm/, which codecov measures: * parallel_request_limiter v1 — check_key_in_limits inline raise (the second raise site, separate from the raise_rate_limit_error helper covered earlier) * dynamic_rate_limiter v1 — RPM raise branch (TPM branch was already covered) * dynamic_rate_limiter v3 — parametrized over all three raise sites: model_saturation_check, priority_model, and the fail-closed fallback for an unrecognized descriptor_key * max_budget_per_session_limiter — full async_pre_call_hook with a mocked agent registry and over-budget cached spend All 42 tests in test_rate_limit_error_unification.py now pass and together exercise every changed import + raise line across the eight refactored proxy hooks. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix: use computed error_message in ProxyRateLimitError detail * fix(parallel-request-limiter): drop None from detail; annotate raise_rate_limit_error as NoReturn The v1 ' raise_rate_limit_error' helper built an unused 'error_message' variable and then assembled the actual ' detail' via an f-string that interpolated 'additional_details' verbatim — producing 'Max parallel request limit reached None' when invoked without arguments (flagged by code review). Fix the helper to: - use the constructed 'error_message' as the detail - annotate the helper as NoReturn since it always raises - drop the redundant 'raise'/'return' at the two call sites Add two regression tests covering both the with- and without- additional_details paths. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(proxy/hooks): drop literal 'None' from raise_rate_limit_error detail The v1 parallel_request_limiter's raise_rate_limit_error helper has a long-standing bug: it computes a None-guarded 'error_message' string but then ignores it and emits an f-string that interpolates the raw 'additional_details' arg. Callers that pass no argument get 'Max parallel request limit reached None' as the user-facing detail. This commit: * wires error_message into the detail kwarg so the None-guard actually applies and operators see a clean message; * changes the return-type annotation from ProxyRateLimitError to NoReturn (the function always raises) so type-checkers know callers after this invocation are unreachable. Greptile P1 + P2 review feedback on PR #27687. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(types): demote TypedDict floating string to a # comment A string literal placed after a field declaration in a TypedDict body is not a per-field docstring — it's an orphaned string expression Python discards. Tools like mypy / pyright that inspect TypedDict fields won't surface that text either. Move the documentation for error_rate_limit_category to a real comment so the intent is visible to readers and type-checker tooling without the misleading docstring framing. Greptile P2 review feedback on PR #27687. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * security(exceptions): do not auto-copy vendor response headers to e.headers A vendor 429 response can set arbitrary headers (Set-Cookie, CORS overrides, …). Previously, when RateLimitError was constructed with only a 'response=' (no explicit 'headers=' kwarg), self.headers fell back to a copy of response.headers. If a downstream proxy serializer ever forwarded e.headers to the client, a malicious upstream could inject browser-interpreted headers for the proxy origin. Drop the fallback. Only headers passed explicitly via the headers= kwarg make it onto self.headers (proxy hooks pass retry-after etc. — they control what's surfaced). Vendor response headers stay reachable on e.response.headers for callers that explicitly want them. Today's proxy_server.py route handlers don't actually forward e.headers on the wire (they construct ProxyException without passing headers), so no current behavior changes — this is a defensive narrowing so the fallback can never be turned into a vector when someone wires e.headers through later. Veria-AI security review feedback on PR #27687. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(rate-limit): regression guards for review-pass fixes Pins down the three review-pass fixes: * test_parallel_request_limiter_v1_helper_no_additional_details — calls raise_rate_limit_error() with no args and asserts the detail does NOT contain the literal string 'None'. Pre-fix, callers got 'Max parallel request limit reached None'. * test_rate_limit_error_does_not_auto_copy_response_headers — passes a vendor httpx.Response with a Set-Cookie header to RateLimitError WITHOUT an explicit headers= kwarg, asserts self.headers stays None (no leak), then re-checks that an explicit headers= kwarg DOES populate self.headers. Vendor headers remain reachable on e.response.headers for callers that explicitly want them. * The existing v1-helper test now also asserts the additional_details string makes it through to the detail. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * feat(rate-limit): add orthogonal RateLimitType (requests/tokens/concurrent_requests/budget/max_iterations) trho's last ask in the LIT-2968 thread: distinguish rate-limit failures by the dimension that was exceeded, not just by who rate-limited (vendor vs. litellm). Adds: - RateLimitType str-enum exposed at `litellm.RateLimitType` with values requests / tokens / concurrent_requests / budget / max_iterations. - `rate_limit_type` kwarg on litellm.RateLimitError + ProxyRateLimitError; None default so existing callers (vendor-429 path in exception_mapping_utils) remain a no-op. - StandardLoggingPayloadErrorInformation.error_rate_limit_type so custom callbacks can split rate-limit failures by cause without parsing free-text error messages. Mirror to error_rate_limit_category extraction in get_error_information(); single isinstance(RateLimitError) check covers both. - map_v3_rate_limit_type() helper to collapse the v3 limiter's internal labels ("requests", "tokens", "max_parallel_requests") onto the public enum so the v3 limiter and dynamic_rate_limiter_v3 share one mapping. Defensive None on unknown values rather than silently picking a wrong dimension. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * feat(proxy/hooks): wire rate_limit_type onto every limiter raise site Each refactored proxy hook now populates rate_limit_type with the dimension that actually tripped the limit, so downstream consumers (custom callbacks, prometheus exporters via the StandardLoggingPayload) can split key/team/user rate-limit failures by cause: - parallel_request_limiter (v1): detect dimension from current vs. limit in the post-cache branch (concurrent_requests > tokens > requests, matches the boolean condition order). Base case (current is None, one limit set to 0) picks the most-specific zero. raise_rate_limit_error() helper accepts an explicit rate_limit_type kwarg with CONCURRENT_REQUESTS default (matches every existing internal call site, including the global-limit branch). - parallel_request_limiter (v3): forward status["rate_limit_type"] through map_v3_rate_limit_type() so "max_parallel_requests" → CONCURRENT_REQUESTS for the public field while the raw v3 jargon stays on the HTTP header for wire-format backward compat. - dynamic_rate_limiter (v1): TPM-zero → TOKENS, RPM-zero → REQUESTS. Pass data["model"] through so callbacks see the model that hit the limit (addresses the secondary "provider missing" complaint in the original Slack thread, partially — the model is what dashboards typically split on). - dynamic_rate_limiter (v3): forward status["rate_limit_type"] via map_v3_rate_limit_type() at every raise site (model_saturation_check, priority_model, fail-closed unknown-descriptor guard). Also pass model. - batch_rate_limiter: limit_type is hard-typed "requests"\|"tokens" — map directly without going through the helper's None branch. - max_budget_limiter, max_budget_per_session_limiter: BUDGET. - max_iterations_limiter: MAX_ITERATIONS. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(rate-limit): cover RateLimitType enum, hook wiring, and StandardLoggingPayload propagation 27 new tests across five new test classes: - TestRateLimitType: enum exposed at litellm.RateLimitType, all five values defined, RateLimitError default is None (vendor 429 path makes no claim about which dimension), accepts both string and enum forms with str-coercion guarantee for downstream JSON serializers. - TestProxyRateLimitErrorType: ProxyRateLimitError default is None, accepts string or enum, doesn't break existing callers that pass nothing. - TestMapV3RateLimitType: pins each v3-internal → public-enum mapping (tokens, requests, max_parallel_requests → concurrent_requests, unknown → None) so a future v3 refactor can't silently swap dimensions. - TestStandardLoggingPayloadCarriesType: the new error_rate_limit_type field reaches the structured payload for both ProxyRateLimitError and plain RateLimitError, is None when unspecified, and is None for non-rate-limit exceptions (symmetric with error_rate_limit_category). - TestProxyHooksWireTypeCorrectly: drives the actual raise sites in the v1 parallel_request_limiter helper, the v3 _handle_rate_limit_error (both "tokens" and "max_parallel_requests" paths), and the batch limiter (both tokens and requests paths) — coverage tools see the new rate_limit_type= kwargs as exercised, not just the import shape. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(rate-limit): cover _coerce_message branches and v1 dimension detection Drives the patch coverage on the new orthogonal RateLimitType wiring up to (or close to) 100% on the touched files. ProxyRateLimitError._coerce_message — was 22% covered, now 100%: * nested {error: {message}} dict * nested {message: {message}} dict (alt key) * dict without 'error'/'message' keys → JSON dump fallback * non-JSON-serializable dict value → str() fallback * non-string non-mapping detail (int) → str() coercion v1 parallel_request_limiter dimension detection — was 0% covered, now exercised across 6 parametrized cases: * check_key_in_limits else-branch: current at concurrent / TPM / RPM cap → asserts rate_limit_type is concurrent_requests / tokens / requests. * check_key_in_limits base case (current is None): max_parallel_requests / tpm_limit / rpm_limit set to 0 → asserts the most-specific zero attribution wins per the helper's order. LIT-2968 Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * feat(proxy/hooks): add ProxyHTTPRateLimitError + provider resolver Introduces a small helper layer used by every proxy-side rate-limit hook so that the 429 they raise carries a populated llm_provider / model — instead of an empty exception.llm_provider that downstream loggers (Prometheus failure metric, observability callbacks) read as 'no provider attribution'. ProxyHTTPRateLimitError inherits from both fastapi.HTTPException (so the proxy server still renders it as a 429) and litellm.exceptions.RateLimitError (so isinstance checks and PrometheusLogger._get_exception_class_name pick up llm_provider). We deliberately don't call RateLimitError.__init__ — it constructs an httpx.Response we don't need and would just add failure surface; attribute parity is what downstream consumers care about. resolve_llm_provider_for_rate_limit() wraps litellm.get_llm_provider defensively. Internal limiter hooks fire from async_pre_call_hook — well before get_llm_provider runs anywhere else in the request lifecycle — so we have to call it ourselves at raise time. If the model is missing or unparseable (alias, router-only model) we fall back to llm_provider='litellm_proxy' rather than letting a second exception leak out and break the request path. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(proxy/hooks): populate llm_provider on parallel-request 429s Both v1 and v3 parallel-request limiters fired bare HTTPException(429) from inside async_pre_call_hook. The downstream Prometheus failure metric reads exception.llm_provider via _get_exception_class_name — the empty value showed up as exception_class='HTTPException' and left model_id='None' on the time series. Threads requested_model through every raise site in: * parallel_request_limiter.py: - check_key_in_limits (the per-key/per-model/per-user/per-team/ per-customer over-limit path) - raise_rate_limit_error (zero-limit + global_max_parallel_requests paths) — now takes an optional requested_model kwarg * parallel_request_limiter_v3.py: - _handle_rate_limit_error (the OVER_LIMIT translator), called from both the should_rate_limit pre-check and the TPM reservation path Resolved via resolve_llm_provider_for_rate_limit so unknown / missing models silently fall back to llm_provider='litellm_proxy' instead of breaking the request path with a second exception. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(proxy/hooks): populate llm_provider on dynamic-rate-limit 429s Same plumbing change as the parallel limiters, applied to both dynamic_rate_limiter (v1) and dynamic_rate_limiter_v3: * v1: TPM-zero and RPM-zero paths in async_pre_call_hook now resolve data['model'] -> (model, llm_provider) once and pass it into both raises. * v3: All three raise sites in _check_rate_limits — the model_saturation_check enforced raise, the priority_model enforced raise, and the fail-closed unknown-descriptor branch — now attribute the 429 to the actual provider. Falls back to llm_provider='litellm_proxy' when the model can't be resolved. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(proxy/hooks): populate llm_provider on batch-rate-limit 429s batch_rate_limiter._raise_rate_limit_error now takes a requested_model kwarg threaded from data['model'] in _check_and_increment_batch_counters. The batch-creation 429 is what gets raised when the input file's tokens/requests count would push the per-key TPM/RPM window over its limit. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(proxy/hooks): populate llm_provider on budget/iterations 429s Final batch of internal raise sites — the user/session-budget and max-iterations hooks. Same pattern: resolve data['model'] once at raise time, attach to ProxyHTTPRateLimitError so Prometheus and observability callbacks can attribute the 429. Hooks updated: * max_budget_limiter (per-user max_budget exceeded) * max_iterations_limiter (per-session agent iteration cap) * max_budget_per_session_limiter (per-session dollar cap) All three fall back to llm_provider='litellm_proxy' when data['model'] is missing or unparseable. Drops the now-unused HTTPException import from each module. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(proxy/hooks): pin provider field on internal rate-limit 429s Regression coverage for the 'provider field missing' bug across every proxy-side rate-limit hook + the helper layer: * ProxyHTTPRateLimitError class shape (HTTPException + RateLimitError, dict-detail stringification, None-provider normalization). * resolve_llm_provider_for_rate_limit happy paths (gpt-4o-mini, anthropic/..., bedrock/...) plus all three fallback branches (None, '', unknown name) plus a 'get_llm_provider raises' case that asserts we swallow the secondary exception. * For each limiter (parallel v1/v3, dynamic v1/v3, batch, max_budget, max_iterations, max_budget_per_session): assert the raised exception is a RateLimitError carrying the resolved model + llm_provider, and a sibling test that asserts the fallback path returns 'litellm_proxy' without leaking a second exception. * Two PrometheusLogger._get_exception_class_name pins so the Prometheus failure metric label flips from 'HTTPException' to 'Openai.ProxyHTTPRateLimitError' (or 'Litellm_proxy.' on fallback) — that's what dashboards consume. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> perf(proxy/hooks): defer provider resolution to over-limit branches * fix: use error_message in raise_rate_limit_error to avoid literal 'None' in detail * Consolidate rate_limiter_utils imports in dynamic_rate_limiter * fix(proxy): set num_retries/max_retries on ProxyHTTPRateLimitError ProxyHTTPRateLimitError inherits from RateLimitError but did not call RateLimitError.__init__, so num_retries/max_retries were never set. When Starlette's HTTPException lacks __str__, MRO falls through to RateLimitError.__str__, which unconditionally reads these attributes and raises AttributeError during logging/traceback formatting. Initialize them to None defensively. * fix(mypy): silence base-class status_code conflict on ProxyHTTPRateLimitError HTTPException declares 'status_code: int' while openai.RateLimitError (via APIStatusError) declares 'status_code: Literal[429] = 429'. Mypy flags the multi-base override as [misc] in CI lint. The runtime semantics are fine (we set self.status_code in __init__), so silence the class-level annotation conflict with a targeted ignore. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix: annotate batch limiter _raise_rate_limit_error as NoReturn * feat(prometheus): rate-limit category/type labels + exception_class back-compat (follow-up to #27687) (#27706) * feat(prometheus): add rate_limit_category and rate_limit_type labels Adds two new labels to litellm_proxy_failed_requests_metric so dashboards can split 429s by rate-limit source (vendor vs. litellm-internal) and by the dimension that was exceeded (requests/tokens/concurrent_requests/ budget/max_iterations) without parsing free-text error messages. Closes the Prometheus side of LIT-2718. The unified RateLimitError.category and .rate_limit_type fields landed in PR #27687 but were only surfaced on StandardLoggingPayload (custom-callback channel); this exposes them on the metric label set as well. Both labels are populated only when the underlying exception is a litellm.RateLimitError; non-rate-limit failures keep them empty. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * feat(prometheus): populate rate-limit labels + preserve exception_class back-compat Two coupled changes in the Prometheus integration: 1. async_post_call_failure_hook now extracts the new RateLimitError .category / .rate_limit_type fields (added in PR #27687) via a _extract_rate_limit_labels helper and forwards them through UserAPIKeyLabelValues onto litellm_proxy_failed_requests_metric. Empty for non-rate-limit failures. 2. _get_exception_class_name special-cases ProxyRateLimitError and keeps emitting 'HTTPException' for the exception_class label. Without this shim, ProxyRateLimitError (which multi-inherits from HTTPException + RateLimitError) would silently flip the label from 'HTTPException' (the historical value for proxy-side 429s) to 'ProxyRateLimitError', breaking existing dashboards / alerts that key off exception_class='HTTPException'. Distinguishing vendor vs. litellm 429s is now the job of the new rate_limit_category label. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(prometheus): cover rate-limit labels and exception_class back-compat Adds 19 tests across: - enum / label-list registration - _extract_rate_limit_labels for vendor RateLimitError, ProxyRateLimitError, non-rate-limit and None inputs (incl. parametrized over every RateLimitErrorCategory x RateLimitType combo) - _get_exception_class_name back-compat: ProxyRateLimitError keeps the legacy 'HTTPException' string while vendor RateLimitError keeps the historical 'Provider.ClassName' format - end-to-end through async_post_call_failure_hook with both ProxyRateLimitError and vendor RateLimitError, asserting both new labels populate and exception_class stays back-compat Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(prometheus): tolerate missing fastapi in lazy ProxyRateLimitError import Address greptile feedback: - async_post_call_failure_hook docstring: drop the stale labelnames listing and reference PrometheusMetricLabels.litellm_proxy_failed_requests_metric as the source of truth so the doc cannot drift from the actual labelset. - _get_exception_class_name: guard the lazy ProxyRateLimitError import with ImportError so router-side fallback callsites don't blow up in non-proxy installs that don't have fastapi (a transitive dep of proxy.common_utils.proxy_rate_limit_error). Behavior is unchanged when fastapi is available. Also fix the existing enterprise callback test that asserted the old labelset on litellm_proxy_failed_requests_metric — it now expects the new rate_limit_category / rate_limit_type labels populated for vendor 429s. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(bugbot): simplify rate-limit label coercion + guard None detail - prometheus.py _extract_rate_limit_labels: RateLimitError.__init__ already normalizes category/rate_limit_type to plain str, so the getattr(.value) + isinstance dance was dead code. Reduce to str(value) if not None. - proxy_rate_limit_error.py _coerce_message: short-circuit None to '' instead of falling through to str(None) = 'None', which produced the literal message 'litellm.RateLimitError: None'. * fix(rate-limit): surface unified category/type fields on BudgetExceededError The most common budget cap (virtual-key max_budget enforcement in auth_checks.py) raises litellm.BudgetExceededError, a bare Exception subclass that bypassed the unified rate-limit error class introduced by PR #27687. Custom callbacks reading StandardLoggingPayload.error_information saw category=None and rate_limit_type=None for these 429s, missing the most common budget case (team / org / end-user budgets all hit the same code path). Surface the fields off BudgetExceededError as plain attributes: - category = RateLimitErrorCategory.LITELLM_RATE_LIMIT - rate_limit_type = RateLimitType.BUDGET - llm_provider = "" (or caller-supplied) Switch get_error_information and _extract_rate_limit_labels from isinstance(RateLimitError) gating to duck-typed attribute reads, guarded by membership in the rate-limit enums so unrelated third-party exceptions exposing a .category attribute can't leak garbage values into the payload. This is strictly additive: BudgetExceededError keeps its bare-Exception base class, so `except BudgetExceededError:` handlers keep firing and `except RateLimitError:` does not start catching budget errors. * fix(rate-limit): validate enum membership at duck-typed read sites + enrich BudgetExceededError llm_provider Two follow-ups uncovered during the second QA pass on PR #27687: 1. Guard third-party `.category` / `.rate_limit_type` attribute leakage. The duck-typed read in `get_error_information` and `_extract_rate_limit_labels` would forward any string attribute named `category` / `rate_limit_type` on an unrelated third-party exception into the StandardLoggingPayload and Prometheus labels — silently mislabeling custom-callback payloads and blowing out Prometheus label cardinality. Add `validate_rate_limit_category` / `validate_rate_limit_type` helpers that gate on the documented enum value sets; non-matching values are dropped to None. 2. Enrich BudgetExceededError.llm_provider from request_data. Budget checks live in tenant-scoped helpers (key / team / org / tag / end-user / project) that don't see the request model, so the BudgetExceededError they raise carried llm_provider="" — leaving custom-metrics consumers without provider attribution for the most common 429 case. Resolve it once at the central UserAPIKeyAuthExceptionHandler seam, before post_call_failure_hook fires, so the StandardLoggingPayload the callback sees has the same provider attribution as RPM/TPM 429s. Regression tests pin both: 4 leakage tests + 4 enrichment tests. The leakage tests would fail under the pre-validation version of either read site; the enrichment tests would fail if the handler skipped the resolver call. * fix(rate-limit): resolve router model_name aliases to real provider (#27914) * fix(rate-limit): resolve router model_name aliases to real provider For nearly every real LiteLLM proxy deployment the request model is a router model_name alias (e.g. 'tpm-locked' -> litellm_params.model: openai/gpt-4o-mini), and 'litellm.get_llm_provider' doesn't know about router aliases — it raises 'LLMProviderNotProvidedError'. The resolver then fell through to the defensive 'litellm_proxy' fallback, so the 'llm_provider' field this PR adds was effectively always 'litellm_proxy' in the field, defeating its purpose for the most common proxy configuration. Add a router-alias fallback step: when 'get_llm_provider' raises, scan the active 'llm_router.model_list' for a deployment whose 'model_name' matches the request model and resolve from its 'litellm_params.model' instead. If multiple deployments share the same alias (load-balancing case) the first one wins — every deployment under one alias should agree on provider in any sensible config, and 'first' is deterministic so the Prometheus label stays stable. Defensive throughout: an uninitialized router, a malformed deployment, a 'litellm_params.model' that itself fails 'get_llm_provider' — every branch falls through to the existing 'litellm_proxy' fallback rather than letting a secondary exception escape and mask the rate-limit error we're trying to surface. Tests: - test_router_alias_resolves_to_underlying_provider: alias 'tpm-locked' -> 'openai/gpt-4o-mini' produces provider='openai', model='gpt-4o-mini'. - test_router_alias_with_multiple_deployments_uses_first. - test_router_alias_unknown_falls_back. - test_router_alias_with_malformed_deployment_falls_back. - Existing fallback test updated to also stub 'litellm.proxy.proxy_server.llm_router' so it exercises the full 'no resolution anywhere' path. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(rate-limit): harden router alias resolver + test isolation - Wrap _resolve_provider_from_router_alias loop in top-level try/except so a non-iterable model_list / unexpected deployment shape can't escape and mask the 429 with a 500. - Type-check litellm_params before .get() to handle non-dict truthy values. - Patch llm_router=None in the parametrized fallback test so a router left by another test in the session can't redirect the unknown-model path. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(bugbot): preserve "BudgetExceededError" Prometheus label Adding llm_provider to BudgetExceededError (so callbacks get provider attribution from StandardLoggingPayload) made the provider-prefix step in _get_exception_class_name silently flip the label from "BudgetExceededError" to e.g. "Openai.BudgetExceededError", breaking dashboards keyed on the historical value. Short-circuit BudgetExceededError in _get_exception_class_name the same way ProxyRateLimitError already is. Provider/category attribution still lands on the new rate_limit_category / rate_limit_type labels. * test: fix invalid 'rpm' rate_limit_type in v3 limiter test mocks The v3 rate limiter only emits 'requests', 'tokens', or 'max_parallel_requests'. Using 'rpm' caused map_v3_rate_limit_type to return None, leaving the expected RateLimitType.REQUESTS untested. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(bugbot): hoist provider resolver + opt-in prom rate-limit labels - dynamic_rate_limiter.py: hoist resolve_llm_provider_for_rate_limit above the TPM/RPM if/elif so the lookup runs once per request, matching the pattern in dynamic_rate_limiter_v3.py. - prometheus.py: gate the new rate_limit_category / rate_limit_type labels on litellm_proxy_failed_requests_metric behind litellm.prometheus_emit_rate_limit_labels (default False). Mirrors the existing prometheus_emit_stream_label opt-in. Preserves the metric's pre-unification label set so existing dashboards / recording rules keep matching after upgrade; operators can enable the new labels once downstream consumers include them. - Tests updated: default-off back-compat case, opt-in path enables the flag before asserting label presence. * fix: stabilize prometheus label sets and drop redundant model normalization - Cache PrometheusLogger.get_labels_for_metric per metric_name so that the label set used to construct counters at __init__ time stays in sync with the label set used at increment time, even if module-level toggles like prometheus_emit_rate_limit_labels or prometheus_emit_stream_label are flipped at runtime. Without this, toggling these flags after the logger was created would cause ValueError from prometheus_client because the runtime labels would not match the counter's declared labelnames. - Drop redundant 'model or ""' guard in ProxyRateLimitError.__init__ where model is already normalized one step earlier. Co-authored-by: Yassin Kortam <yassin@berri.ai> * perf(dynamic_rate_limiter): only resolve provider when rate limit hit Co-authored-by: Yassin Kortam <yassin@berri.ai> * test(prometheus): clear cached metric labels after toggling rate-limit flag The PrometheusLogger caches each metric's label set at construction time so that labels used at counter.labels(...) time stay consistent with the labels the metric was registered with. The enterprise async_post_call_failure_hook test toggles litellm.prometheus_emit_rate_limit_labels = True AFTER the fixture has already built the logger, so without invalidating the cache the rate_limit_category / rate_limit_type labels never reach the mocked counter and the assert_called_once_with check fails. Co-authored-by: Yassin Kortam <yassin@berri.ai> * test: fix CI failures from prom label cache + flaky time-window assertion PrometheusLogger.get_labels_for_metric now caches the per-metric label set at first read so the labels passed to counter.labels(...) stay in lock step with the labels the counter was registered with. This broke two existing test patterns: - test_prometheus_labels.py: tests bind the real method onto a MagicMock, but MagicMock auto-creates a Mock for _cached_metric_labels whose .get(...) returns a truthy Mock — treated as a populated cache and returned as the label set, producing empty filtered labels and KeyError on labels["requested_model"] / ["route"]. Seed real {} containers for _cached_metric_labels and label_filters before binding. - test_prometheus_logging_callbacks.py::test_set_team_budget_metrics_with_custom_labels: the fixture builds the logger before the test monkeypatches litellm.custom_prometheus_metadata_labels, so the cached label set never picks up the new metadata labels. Clear the cache after the monkeypatch (same pattern already used for the rate-limit toggle in test_async_post_call_failure_hook). UI: view_logs/index.test.tsx "Last Minute" window assertion is off by one at the minute boundary. start_date is floored to the minute, so the dropped sub-minute fraction can push the truncated-seconds diff up to (minMinutes+1)60 exactly when the click lands near a minute rollover. Switch the upper bound to toBeLessThanOrEqual. feat(otel-v2): surface rate_limit_category + rate_limit_type on failed LLM-call spans PR #28909 introduced the typed v2 OTel engine that builds spans from StandardLoggingPayload, with SpanError carrying error_type + message and the genai mapper stamping error.type onto every failed LLM-call span. This PR's earlier commits added error_rate_limit_category and error_rate_limit_type to the same StandardLoggingPayload.error_information the v2 engine reads — but neither field reached a span attribute, so v2 OTel traces stayed opaque about why a 429 fired (vendor vs litellm, RPM vs TPM vs concurrent vs budget vs max_iterations) even after the custom-callback and prometheus surfaces gained that decomposition. Three coupled changes: 1. semconv.py: add LiteLLM.ERROR_RATE_LIMIT_CATEGORY / LiteLLM.ERROR_RATE_LIMIT_TYPE under the litellm.* vendor namespace (no GenAI semconv equivalent exists for who-rate-limited / which-dimension). 2. payloads.py: extend SpanError with rate_limit_category + rate_limit_type, populated by _parse_error() from the same error_information.error_rate_limit_* fields the custom-callback channel and prometheus rate_limit_category / rate_limit_type labels read. Single source of truth across all three observability surfaces. 3. mappers/genai.py: stamp the two attributes on the LLM-call span when present. drop_none guarantees they stay absent (not 'None') for non-rate-limit failures so trace consumers can read them unconditionally. Three regression tests in test_otel_v2_emitter.py pin: a vendor / litellm-internal RateLimitError lands category=litellm_rate_limit + rate_limit_type=requests on the span; a BudgetExceededError lands rate_limit_type=budget; a non-rate-limit failure (BadRequestError) keeps the rate_limit_* attributes absent. Mutation-tested against reverting either the SpanError extension or the _parse_error read site — both new tests fail under either mutation. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test: align prometheus user-budget + logs quick-select tests with merged code The merge into this branch left two test patterns out of step with the code they exercise. test_set_user_budget_metrics_includes_user_email_and_alias_labels_when_opted_in flipped litellm.prometheus_user_budget_label_include_email_alias after the fixture had already built the PrometheusLogger. get_labels_for_metric now snapshots each metric's label set at construction time, so the runtime flip no longer reached the cached labels. Enable the flag before constructing the logger, matching how the proxy applies config at startup. view_logs/index.test.tsx referenced uiSpendLogsCall and moment without importing them, and the merged index.tsx now fetches through useLogFilterLogic (the hook the file stubs out) rather than calling uiSpendLogsCall directly. Add the imports and restore the real hook for the Quick Select window assertions so the call is actually observed. * refactor(otel/v2): drop rate-limit decomposition from the LLM-call span Proxy-side rate limits (litellm_rate_limit, budget, max_iterations) are rejected at the gate before any upstream call, so async_post_call_failure_hook tags the synthetic failure log with LITELLM_LOGGING_NO_UPSTREAM_LLM_CALL and the v2 OTel logger never opens an LLM-call span for them; the litellm.error.rate_limit_category / litellm.error.rate_limit_type attributes were dead for exactly the cases they were meant to surface. The only failure that does open an LLM-call span carrying a RateLimitError is a vendor 429, where rate_limit_type is always None and the category just restates error.type=RateLimitError. The decomposition still reaches downstream consumers through StandardLoggingPayload.error_information.error_rate_limit_* and the prometheus rate_limit_category / rate_limit_type labels, both unchanged. Removes the SpanError fields, the _parse_error reads, the genai mapper attributes, the semconv keys, and the three span tests that asserted a scenario that never reaches the mapper in production. * fix(batch_rate_limiter): map max_parallel_requests to concurrent_requests * refactor(prometheus): drop transitive fastapi import from _get_exception_class_name Read the legacy exception_class label from a prometheus_exception_class_name marker on ProxyRateLimitError instead of importing the proxy module, keeping the integrations layer free of a transitive fastapi dependency. * chore(ui): sync schema.d.ts with unified rate-limit error spec The ProxyRateLimitError docstring flows into the proxy OpenAPI spec's 429 response description, so the generated dashboard types were out of sync. Regenerated via npm run gen:api (Check UI API Types Sync). --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> Co-authored-by: Yassin Kortam <yassin@berri.ai>	2026-06-06 17:50:29 -07:00
yuneng-jiang	7bfce053a9	fix(ui): make workflow runs page fill full width (#29868 ) The Workflow Runs page rendered its table at roughly a quarter of the available width. Its root container is a flex child of the dashboard content row but set only padding, min-height and background, so with no width it shrank to the table's natural content size. Sibling pages (logs, memory) fill the area with a full-width root; mirror that by setting width 100% on the container. Fixes LIT-3636	2026-06-06 17:41:36 -07:00
ryan-crabbe-berri	f31d059aa3	feat(ui): add budget duration to edit team member form (#29717 ) * feat(ui): add budget duration to edit team member form Editing a team member created a member budget with no duration, so the budget never reset. This threads a budget reset period through the edit flow end to end and reuses the shared duration dropdown so the options stay in sync with the rest of the UI. Resolves LIT-2651 * fix(proxy): validate member budget_duration and persist clears Reject budget_duration values that can't be parsed, are non-positive, or overflow date math before any write, so a bad value can't be persisted and later crash the budget reset job. Clearing the budget duration in the edit-member form now sends null and clears the column end to end, so the dropdown's clear control reflects a real change instead of being a no-op * chore(ui): regenerate schema.d.ts for member budget_duration Adds budget_duration to TeamMemberUpdateRequest/Response in the generated dashboard types so the Check UI API Types Sync gate passes	2026-06-06 17:24:55 -07:00
Mateo Wang	aeb55e7a11	fix(mcp): highlight MCP cards red when the logged-in user is missing per-user env vars (#29856 ) * fix(mcp): flag missing per-user env vars on the card for every accessible server The dashboard MCP card grid lists servers via the registry-backed manager (get_all_mcp_servers_unfiltered for admins in view_all mode, the allowed-context aggregation otherwise), but the per-user env-var status endpoint that drives the red "user fields missing" highlight resolved servers through the much narrower get_all_mcp_servers_for_user, which only returns servers explicitly granted on the calling key. An admin's dashboard session key carries no per-server MCP grant, so the status feed came back empty and the card never turned red even when the logged-in user had not filled in their required variables. Both surfaces now share a single _resolve_accessible_mcp_servers helper, so the status feed is computed over exactly the cards the user sees. The helper returns servers unredacted; the status endpoint needs the raw env_vars and still only ever reports is_set booleans, never the stored secret values. * test(mcp): drop dead get_all_mcp_servers_for_user patch from view_all regression test The bulk status endpoint resolves servers through _resolve_accessible_mcp_servers now, so the old get_all_mcp_servers_for_user patch in the admin view_all regression test is never hit. Removing it keeps the test honest about which code path it exercises.	2026-06-06 16:51:25 -07:00
Mateo Wang	d61f7747c0	feat(bedrock): forward strict and additionalProperties to Converse toolSpec (#29814 ) * feat(bedrock): forward strict and additionalProperties to Converse toolSpec Bedrock Converse supports strict in toolSpec since 2026-02, but _bedrock_tools_pt only whitelisted type/properties/required/name/description, so strict: true was silently dropped and Claude-on-Bedrock ignored enum constraints that GPT and direct-Anthropic honored. Forward strict from the OpenAI function and additionalProperties from the schema (Bedrock requires the latter alongside strict), passing each only when present. https://claude.ai/code/session_01WQjWd8NfUB3vxERwudbHkv * fix(bedrock): only forward strict tool schemas to Claude on Converse Nova, Llama and GPT-OSS on Bedrock reject the strict field (BedrockException 'This model doesn't support the strict field'), and the GPT-OSS request-body test asserts strict/additionalProperties are stripped. Forwarding them to every model broke the llm_translation suite, so gate the forwarding on the anthropic base model since only Claude honours strict tool schemas on Bedrock.	2026-06-06 16:28:18 -07:00
milan-berri	273855b4e2	fix(responses-bridge): map system-only chat request to system input item (#29817 ) System-only chat requests mapped the system message to instructions and left input=[], which OpenAI's Responses API rejects (it also rejects input=""). When no other messages are present, carry the system message as a role:"system" input item (single copy, correct role) instead of leaving input empty. Mirrors the existing handling of non-string system content. Fixes Open WebUI new-conversation failures on mode:responses Codex models. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-06 16:11:54 -07:00
Yassin Kortam	68d67212cd	fix: 400 on Anthropic context overflow; seed identity on failed auth (#29848 )	2026-06-06 14:57:41 -07:00
yuneng-jiang	f1667b9137	chore(deps): bump deps (#29860 ) * bump: version 0.4.73 → 0.4.74 * bump: version 1.88.0 → 1.89.0 * uv lock	2026-06-06 21:44:54 +00:00
Mateo Wang	33c363d4d4	Extend the record/replay proxy to chat, embeddings, moderations, rerank, and Anthropic (#29847 ) * test(ci): extend record/replay proxy to chat, embeddings, moderations, rerank, anthropic The record/replay proxy that took the gpt-image-1 spend E2E off the live OpenAI path now fronts every provider, so the other real-provider E2Es stop paying for and depending on live calls each commit. It keys per upstream and selects a non-OpenAI provider by a /__recorder_upstream/<host>/ path prefix carried on the model's api_base, since some litellm handlers (cohere rerank) drop custom request headers. Wired into build_and_test (chat, embeddings, moderations, image), the otel job (cohere rerank), and the anthropic-messages job via a reusable start_openai_record_replay_proxy command. Dropped the time.time()/uuid prompt cache-busters in the build_and_test chat tests, whose config has the response cache off, so identical requests are recordable. The image spend test now asserts a repeat call still bills spend, failing loudly if the proxy response cache is ever turned on. Responses, the anthropic passthrough, bedrock, and fake-endpoint tests are left live: their lifecycles, api_base assertions, providers, or fake targets make a stateless body-keyed cache either break them or add nothing. * docs(ci): note the recorder command's OpenAI default upstream and prefix override Addresses a review note: the shared start_openai_record_replay_proxy command defaults the upstream to OpenAI, so a non-OpenAI model must carry the /__recorder_upstream/<host>/ prefix on its api_base. Document that in the command description so a future caller does not assume the default follows the provider.	2026-06-06 14:33:42 -07:00
Yassin Kortam	38b28b96ff	fix(terraform/gcp): abandon SQL user on destroy (#29855 ) google_sql_user.app issues DROP ROLE on destroy, which Postgres refuses because the role owns every table the migrations job created (75 objects). The previous deletion_policy=ABANDON on google_sql_database keeps the DB intact through destroy, so the role still owns its objects. Set the same policy on the user; the instance deletion takes both the database and the role with it anyway.	2026-06-06 13:42:35 -07:00
Yassin Kortam	43c10370ee	fix(terraform/gcp): prompt for image_registry in DeployStack one-click (#29852 ) * fix(terraform/gcp): prompt for image_registry in DeployStack one-click The four litellm-* images live on GHCR and Cloud Run rejects ghcr.io URIs at apply time, so every deploy has to point image_registry at an Artifact Registry remote repo. The DeployStack installer didn't surface image_registry as a prompt, so a click-through user landed on the ghcr.io/berriai default and the apply failed ~20 min in, after Cloud SQL had already provisioned. Add image_registry to custom_settings with a PROJECT_ID-placeholder default and a description that flags the ghcr.io rejection so the failure happens at the prompt, not after billing the slow path. TUTORIAL.md is reworded to tell the user what to enter at the new prompt instead of "edit terraform.tfvars before applying". * fix(terraform/gcp): generalize image_registry default to any region Per Greptile feedback on #29852, the prior default hardcoded us-central1 and would silently produce a Cloud Run-incompatible image path for any deployment in another region. The user would substitute PROJECT_ID, miss the region segment, and reproduce the original late-apply failure. Use REGION as a second placeholder and tighten the prompt copy so both substitutions are mandatory. * fix(terraform/gcp): make destroy work without manual intervention Three Cloud Run v2 services and the migrations Cloud Run v2 job all default to deletion_protection=true at the provider level, which has no data-safety value on stateless resources and blocks terraform destroy with an error that can only be unstuck with a tfvars edit + apply roundtrip. Wire deletion_protection=false directly on all four; the operator-facing tripwire that matters is cloudsql_deletion_protection, which guards the only resource that actually holds data. The litellm Cloud SQL database also drops cleanly only if every connection is closed first. Cloud Run services and the migrations job hold connections open until they're torn down, so destroy races and fails with "database is being accessed by other users". Setting deletion_policy=ABANDON on the database resource lets terraform skip the explicit drop; the Cloud SQL instance deletion takes the database with it anyway. Together these turn destroy into a single command, matching the AWS stack's behavior.	2026-06-06 20:21:06 +00:00
yuneng-jiang	1975b9691a	chore: update Next.js build artifacts (2026-06-06 20:08 UTC, node v20.20.2) (#29853 )	2026-06-06 13:17:59 -07:00
Yassin Kortam	1cff02f50e	refactor: convert AWS and GCP Terraform stacks into reusable modules … (#28103 ) * refactor: convert AWS and GCP Terraform stacks into reusable modules with examples/default entry point - Remove `provider` blocks from both AWS and GCP stack roots so the modules can be consumed with `count`, `for_each`, `depends_on`, assumed-role or aliased providers — patterns that are forbidden when a module owns its own provider configuration - Add `examples/default/` thin-root wrappers for both stacks that wire the provider (AWS) / providers (google + google-beta) and call the module with a curated variable surface, preserving the one-command deploy experience - Move `terraform.tfvars.example` files into `examples/default/` alongside the new roots; update example comments to reflect the curated variable surface - Thread `local.tags` (containing `litellm:stack`, `managed-by`, and `var.tags`) explicitly onto every taggable AWS resource since the module no longer controls the provider's `default_tags`; GCP resource labels already flow through the module's `labels` input - Add `examples/default/variables.tf` and `outputs.tf` for both stacks, exposing the most-used knobs and re-exporting all module outputs - Commit provider lock files for both examples so `terraform init` is reproducible without a network fetch - Update top-level and per-stack READMEs to document the module-first design, the `for_each` multi-tenant pattern, and the `examples/default/` quick-start path * docs(terraform): address review — state-migration guide, tag dedupe, for_each note - Add 'Migrating an existing deployment' section to AWS & GCP READMEs documenting the required terraform state mv step (resource addresses now gain a module.litellm. prefix under the examples/default root) - Remove redundant managed-by tag from the AWS example providers.tf; reserve default_tags there for org-wide tags only - Document the for_each single-provider limitation for GCP (no configuration_aliases) in the README and example main.tf Resolves LIT-3504 * docs(terraform/gcp): note expected SSL cert replacement in state-migration guide The managed SSL cert is named with a hash of lb_domains, so TLS-enabled stacks that migrated from the old un-hashed name will see one create_before_destroy cert replacement after terraform state mv — not a clean 'No changes'. Document that this single replacement is expected and safe. * docs(terraform): drop state-migration guides The AWS/GCP stacks have never been published, so there are no existing deployments to migrate from the old root-module layout. Remove the 'Migrating an existing deployment' sections from both READMEs. * docs(terraform): call out image-registry override required for GCP 1-click The GCP stack's default image_registry points at ghcr.io, which Cloud Run won't authenticate against, so any real deploy (HCP Terraform no-code or otherwise) must override it. Document that as a hard requirement on the GCP README rather than a side note, and add a top-level HCP Terraform 1-click section enumerating the required inputs per stack and the migration-task caveat for HCP-hosted runners. * feat(terraform/aws): mount proxy_config from S3 and wire OpenTelemetry v2 proxy_config Drop the inline LITELLM_PROXY_CONFIG_B64 env var. Upload the YAML to S3 at config/litellm-config.yaml; gateway and backend container entrypoints download it to /tmp/litellm-config.yaml via boto3 before exec'ing uvicorn. The S3 object etag is wired into the task definition so a config edit produces a new task-def revision and a rolling redeploy. The existing s3_access policy already grants the task role s3:GetObject on this bucket, so no IAM changes were needed for the mount itself. OpenTelemetry v2 New variables otel_endpoint, otel_exporter, otel_service_name, and otel_headers_secret_arn. Setting otel_endpoint to a non-empty value adds LITELLM_OTEL_V2=true plus OTEL_EXPORTER / OTEL_ENDPOINT / OTEL_SERVICE_NAME / OTEL_ENVIRONMENT_NAME to the shared env block; an optional Secrets Manager ARN backs OTEL_HEADERS for collectors that need an auth header. Execution role auto-gains GetSecretValue on that ARN. Empty endpoint = nothing added, so existing deployments are unchanged. * feat(terraform/gcp): add DeployStack one-click installer Wires up a Cloud Shell "Open in Cloud Shell" badge backed by the GoogleCloudPlatform DeployStack flow so examples/default can be installed from a click in the README without a local terraform setup. - examples/default/deploystack.json drives project/region collection plus prompts for tenant, env, image_tag, and allow_plaintext_lb. Complex inputs (proxy_config, _extra_secrets, lb_domains) and sensitive vars (litellm_master_key, litellm_license, ui_password) stay tfvars / env only so they never land in a committed file. - examples/default/TUTORIAL.md is a Cloud Shell walkthrough that enables required APIs, creates the GHCR-passthrough Artifact Registry repo, optionally exports the TF_VAR_ secrets, runs `deploystack install`, and shows how to fetch the master key plus migrate from plaintext LB to TLS. - Renames var.project to var.project_id across the module and the examples/default wrapper to match the variable DeployStack injects from `collect_project: true`. Breaking rename for anyone with a `project = ...` line in terraform.tfvars; the fix is one line. * feat(terraform/gcp): mount proxy_config from GCS and wire OpenTelemetry v2 proxy_config Drop the inline LITELLM_PROXY_CONFIG_B64 env var and the python-decode startup fragment. Upload the YAML to a dedicated GCS bucket as config.yaml, then mount it read-only into the gateway and backend at /etc/litellm via Cloud Run v2's gcsfuse volume. CONFIG_FILE_PATH points at the mount; an md5 of the YAML rides along as PROXY_CONFIG_HASH so a config-only edit forces a new Cloud Run revision (gcsfuse only surfaces new objects on container restart, so without the hash an updated proxy_config would sit in the bucket unread). The config bucket is separate from the data-plane bucket so the runtime SA can hold objectViewer here (read-only at runtime) while keeping objectAdmin on the data-plane bucket. Both bucket and IAM binding are gated on proxy_config != {}; an empty config skips bucket creation and mounts nothing. OpenTelemetry v2 LITELLM_OTEL_V2=true is now wired into shared_env_kv unconditionally so both the gateway and backend boot with the integration enabled. It's dormant until otel_endpoint is non-empty; setting it injects OTEL_EXPORTER / OTEL_ENDPOINT / OTEL_ENVIRONMENT_NAME plus a per-component OTEL_SERVICE_NAME (\${tenant}-litellm-\${env}-{gateway,backend}) so spans land tagged with the right hop. otel_headers_secret takes a Secret Manager resource ID for OTEL_HEADERS (collector auth); the runtime SA auto-gains roles/secretmanager.secretAccessor on it. otel_capture_message_content defaults to no_content matching the litellm default. Any OTEL_* key set in _extra_env wins over the defaults so Cloud Run doesn't reject the apply on the duplicate-env-name check. refactor(terraform): make AWS and GCP stacks behave identically Bring both modules to the same surface and the same runtime behavior so swapping clouds (or reading either README) is symmetric. Labels and tags. GCP previously stamped var.labels onto only the two GCS buckets, leaving Cloud Run, Cloud SQL, Memorystore, Secret Manager, and the LB resources unlabeled; the variable description claimed full coverage. Now the module computes local.labels (litellm-stack + managed-by + var.labels, mirroring AWS's local.tags) and threads it onto every label-supporting resource: Cloud Run services and the migrations job, Cloud SQL writer and reader (via user_labels), Memorystore, Secret Manager entries (master_key, license, ui_password, db_password), both GCS buckets, the global LB address, and the http/https forwarding rules. GCP keys use 'litellm-stack' instead of AWS's 'litellm:stack' because GCP label keys forbid colons; var.labels now defaults to {}. OpenTelemetry v2 is opt-in on both stacks. AWS already gated everything on otel_endpoint; GCP previously stamped LITELLM_OTEL_V2=true into shared_env unconditionally and only ungated the OTEL_* block. Both stacks now do the same thing: leave otel_endpoint empty and nothing OTel-related lands in the container env; set it and gateway and backend get LITELLM_OTEL_V2=true plus OTEL_EXPORTER, OTEL_ENDPOINT, OTEL_ENVIRONMENT_NAME, OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT, and a per-component OTEL_SERVICE_NAME (${tenant}-litellm-${env}-gateway or -backend) so spans land tagged with the right hop. AWS picks up the richer GCP surface: otel_environment_name (defaults to var.env), otel_capture_message_content (defaults to no_content), and _extra_env override filtering so a caller-set OTEL_ key wins over the default for that service (ECS allows duplicates, but the filter gives the same predictable last-wins shape Cloud Run enforces). var.otel_service_name on AWS is gone, replaced by the per-component naming. uvicorn workers. GCP gains gateway_num_workers, matching AWS; threads into the gateway args as --workers ${var.gateway_num_workers}. Docs reflect the parity: each README's OTel section, the GCP 'Using as a module' Labels paragraph, and a new feature-parity table in the top-level README that lays out the AWS/GCP input mapping side by side. * fix(terraform/aws): expose skip_final_snapshot through the default example The example wrapper already exposed `s3_force_destroy` so ephemeral / CI stacks could destroy the S3 bucket without manual cleanup, but the matching Aurora knob (`skip_final_snapshot`) was hidden behind the module surface. That meant a `terraform destroy` on a trial stack still produced a `<cluster>-final-<short-sha>` snapshot, with no opt-out short of editing the module call. Adds `var.skip_final_snapshot` to the example (default `false`, preserving the data-loss tripwire) and threads it through to the module input, mirroring the existing `s3_force_destroy` pattern. Documented alongside it in the tfvars example. Verified by deploying the example end-to-end against a clean AWS account (VPC + Aurora w/ IAM auth + Redis + ALB + 3 ECS services), confirming all services reach steady state and the data plane serves traffic, then running `terraform destroy` with `skip_final_snapshot = true` to a clean teardown (93 destroyed, no Aurora snapshot left behind, no leftover billable resources). --------- Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu> Co-authored-by: yassin-berriai <yassin.kortam@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>	2026-06-06 12:57:44 -07:00
Shivam Rawat	fdade8a84e	Title: fix(proxy): resolve vector store file list credentials from team deployments (#29739 ) * fix(proxy): resolve vector store file list credentials from team deployments GET /v1/vector_stores/{id}/files now uses the same router credential routing as POST, including JWT team model hints and wildcard model selectors, so list requests no longer call OpenAI with Bearer None. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(proxy): authorize model hints and fix credential routing for vector store file list Resolves three review findings on the vector store file list path. Authorize user-controlled model hints (?model= query param and the x-litellm-model header) against the key's and team's allowed models via can_key_call_model / _can_object_call_model before any deployment credentials are resolved, closing a model access bypass where a normal key could file-list using a restricted deployment's provider credentials. Run the managed vector store registry resolution before the model routing hint so the managed store sets the routing model first; the hint resolver then selects credentials matching that model instead of a team fallback deployment, avoiding a credential/model mismatch across deployments. Skip team-fallback deployments whose provider cannot be determined instead of treating them as OpenAI, so a deployment without an explicit custom_llm_provider or "openai/" prefix no longer has its credentials injected. * fix(proxy): enforce vector store file model auth Ensure vector store file listing routes authorize explicit and inferred model routing before resolving deployment credentials. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(proxy): type guard vector store model hints Keep vector store model hint authorization typed to string-only values so static checks pass. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-06 12:36:05 -07:00
Shivam Rawat	1fbb78d2a4	Title: Fix managed batch cancel credential resolution (#29734 ) * Fix managed batch cancel credential resolution Decode unified batch IDs before cancel routing and resolve litellm_credential_name to api_key in Router._acancel_batch so JWT team-scoped deployments cancel with the same credentials used at create time Co-authored-by: Cursor <cursoragent@cursor.com> * fix batch cancellation credential cleanup Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-06 12:35:18 -07:00
Mateo Wang	51769a8ede	feat(fal_ai): add Nano Banana / Gemini 2.5 Flash Image generation support (#29798 ) * feat(fal_ai): add Nano Banana / Gemini 2.5 Flash Image generation support Adds a FalAINanoBananaConfig for fal.ai's Nano Banana models, exposed under both fal-ai/nano-banana and fal-ai/gemini-25-flash-image (identical schema). This is the migration path for fal-ai/imagen4, which fal deprecates on 2026-06-30. The config derives the request endpoint from the model name so both aliases route correctly, maps OpenAI image params to the fal schema (n -> num_images, size -> nearest supported aspect_ratio, response_format ignored since the model returns URLs), and reuses the base fal response parser. Pricing is registered at 0.039 per image in the cost map and backup. * fix(fal_ai): tighten nano-banana routing and guard mapped params Match the specific gemini-25-flash-image / gemini-2.5-flash-image aliases instead of any model containing gemini so future fal.ai Gemini-branded models aren't silently misrouted to the nano-banana config. Guard the param mapping on the fal-side keys (num_images, aspect_ratio) so a pre-set mapped value is respected and an OpenAI key is never forwarded unmapped. * fix(fal_ai): drop non-existent gemini-2.5-flash-image routing alias fal.ai only serves the dotted-free fal-ai/gemini-25-flash-image and fal-ai/nano-banana endpoints. Routing the dotted gemini-2.5-flash-image alias built a https://fal.run/fal-ai/gemini-2.5-flash-image URL that fal.ai 404s and had no pricing entry, so spend tracking silently fell to zero. Match only the two real endpoint slugs.	2026-06-06 11:16:44 -07:00
tin-berri	21d2c3aa83	fix(ui): stop MCP playground tool calls from sending twice (#29821 )	2026-06-06 18:14:37 +00:00
Mateo Wang	b3297fc2ea	feat(proxy): hot-reload .env in dev when running with --reload (#29783 ) * feat(proxy): hot-reload .env in dev when running with --reload The --reload watcher already restarts the worker on .py and --config YAML edits, but .env was unwatched, so changing a key there did nothing until a manual restart. Add .env to the uvicorn reload_includes (and to the StatReload monkeypatch, which ignores reload_includes) so an edit triggers a worker restart. A reloaded worker is a fresh process that inherits the reloader's environment, so load_dotenv(override=False) would keep serving the stale inherited value for any key already in the environment. The CLI now exports LITELLM_DEV_ENV_HOT_RELOAD when --reload is set, and litellm/__init__.py reads it to load .env with override=True only on that dev path, leaving normal startup precedence untouched. feat(proxy): warn that --reload makes .env override shell env vars When --reload is active, worker processes re-read .env with override=True, so .env values win over shell-exported environment variables. Surface this dotenv precedence change with a startup warning so a developer who relies on a shell-exported override is not silently surprised. * fix(proxy): type reload helper paths as Optional[str] to satisfy mypy * fix(proxy): watch the cwd .env in both reload backends for parity WatchFiles only watches cwd (and the --config dir) for .env, while the StatReload fallback used find_dotenv(usecwd=True), which walks up to a parent-dir .env that WatchFiles never sees. Point StatReload at the same cwd .env so the two reload backends react to the same file.	2026-06-06 09:39:21 -07:00
Mateo Wang	aa7845dc5e	test(ci): make the image-gen record/replay proxy report cache mode and per-request HIT/MISS (#29802 ) The recorder could come up pointed at a missing or unreachable cassette redis and silently forward every request live; the health check still passed and the process logged nothing, so a CI run looked identical whether it replayed from the cassette or paid OpenAI for a fresh call every commit. There was no way to tell from the logs whether the 24h caching was actually happening. It now announces its mode at startup (REPLAY when the cassette redis is reachable, PASSTHROUGH when CASSETTE_REDIS_URL is unset, DEGRADED when it is set but the redis is unreachable) and logs a HIT/MISS line per request. _cache_set returns whether the write landed so a mid-run redis failure surfaces as a warning instead of masquerading as a successful record. Adds unit tests covering the three startup modes and the HIT/MISS/not-recorded request paths; both new behaviors were mutation-checked.	2026-06-06 09:36:06 -07:00
ryan-crabbe-berri	001bda37d9	refactor(ui): route query-building networking calls through apiClient (#29815 )	2026-06-06 09:18:44 -07:00
milan-berri	1f171ee018	fix(ui): require new expiration when regenerating an expired key (#29838 )	2026-06-06 09:18:19 -07:00
tin-berri	22186f457a	fix(ui): persist Tools-tab MCP OAuth token to DB (#29809 )	2026-06-05 22:29:56 -07:00
ryan-crabbe-berri	6955e6f2c2	refactor(ui): route behavior-preserving networking calls through apiClient (#29806 ) * refactor(ui): route callbacks/nudges calls through apiClient * refactor(ui): route alerting + key/user/team delete calls through apiClient * fix(ui): late-bind fetch in apiClient so global.fetch swaps take effect createApiClient captured fetch at construction time, so reassigning global.fetch (as tests do) had no effect and a real network call leaked. Resolve fetch per request instead; harmless in production where fetch is never swapped, and required for apiClient-based calls to be testable. * refactor(ui): route behavior-preserving networking calls through apiClient Collapse ~61 hand-rolled fetch() calls whose semantics already match the shared apiClient (auth header, JSON body, json-error + deriveErrorMessage + handleError) into apiClient.get/post/etc. Query-string builders and the divergent error-handling functions (no-check, custom messages, text-error) are intentionally left for a follow-up normalization pass, since converting them changes wire encoding or error behavior. Prunes the now-stale no-restricted-syntax suppressions for the removed fetch calls. * refactor(ui): convert remaining admin/guardrail GETs, guard late-bind fetch Routes adminspendByProvider, adminGlobalActivity, and the three guardrail submission calls (list/approve/reject) through apiClient so they match their already-converted siblings instead of staying on raw fetch. Adds a client.test.ts case that swaps globalThis.fetch after createApiClient() and asserts the swap takes effect, which fails on the pre-fix captured-fetch line and locks in the per-call resolution	2026-06-05 20:40:41 -07:00
Mateo Wang	4ec4ab99d0	feat(mcp): per-server env vars with global + per-user scopes (#28917 )	2026-06-05 20:15:11 -07:00
yuneng-jiang	53cf3d8416	fix(proxy): drop deleted team BYOK model name from team.models (#29820 ) Deleting a team-scoped BYOK model left its public name in team.models, so /models with a team key kept listing the now-deleted "ghost" model. delete_model stripped team.models using only litellm_modeltable alias lookups, but models added via /model/new with a team_id never create an alias row; their public name lives only in team.models and model_info.team_public_model_name, so it was never removed. The team cache was also left stale because the delete path skipped _refresh_cached_team. The cleanup now keys off team_public_model_name (falling back to alias keys), runs after the deployment row is deleted, and strips a public name only when no remaining team deployment still backs it, so a load-balanced replica is not revoked and concurrent deletes cannot leave a ghost. The updated team row is refreshed in cache so /models reflects the change immediately	2026-06-05 18:35:50 -07:00
ryan-crabbe-berri	e53bd7cbd1	feat(ui): generate dashboard API types from the proxy OpenAPI spec (#29816 ) * feat(ui): generate dashboard API types from the proxy OpenAPI spec Introduces the shared type foundation for the dashboard without touching any runtime code. The proxy's FastAPI app is the source of truth; app.openapi() emits the spec and openapi-typescript turns it into src/lib/http/schema.d.ts. Adds an npm run gen:api script (a Python spec dump piped into openapi-typescript) and a Check UI API Types Sync CI job that regenerates the file from the live spec and fails if it drifts, so the committed types can never silently fall out of step with the backend. The generated file is pinned to openapi-typescript 7.13.0 and excluded from prettier, eslint, and knip, and marked linguist-generated so it collapses in diffs. No openapi-fetch and no call-site changes yet; this only makes the types exist. * chore(ui): tidy gen-api-types script per review Write the spec dump inside a with-block and clean up the temp dir in a finally, so repeated local runs don't leave stray ~MB JSON files behind.	2026-06-05 17:20:01 -07:00
milan-berri	b7f47a3b52	fix(jwt): use resolved DB user_id for spend on legacy email match (#29217 ) * fix(jwt): attribute spend to resolved DB user_id on email/sso fuzzy match When user_id_upsert is enabled with JWT auth and a pre-migration user row exists whose user_email matches the JWT email but whose user_id is a UUID, get_user_object resolves the legacy row via fuzzy lookup, but the JWT-claim user_id (the email) still flowed into team-membership lookup, JWTAuthBuilderResult.user_id, UserAPIKeyAuth and the spend tables. Spend was orphaned under a phantom email id; /user/info and the Usage page showed $0 for the legacy user (GH #26789). Treat the resolved user_object as the source of truth: add _canonical_user_id_from_db, rebind inside get_objects, and return effective_user_id so auth_builder unpacks it without adding statements. Fixes #26789 Co-authored-by: Cursor <cursoragent@cursor.com> * fix(jwt): log user_id rebind at DEBUG to avoid email PII in INFO streams Greptile review on #29217: rebinding often logs JWT email claims at INFO. Co-authored-by: Cursor <cursoragent@cursor.com> * test(jwt): update passthrough allowlist mock for 5-tuple get_objects Staging #29256 added a test that still mocked get_objects with a 4-tuple; our PR expanded the return to 5 values (effective_user_id). Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-05 15:59:41 -07:00
Sameer Kankute	95e3d136e1	test(google): add google-genai SDK proxy integration tests (#29781 ) * test(google): add google-genai SDK proxy integration tests for Gemini and Vertex Pin google-genai in the CI dependency group and exercise streaming/non-streaming generate_content through the LiteLLM proxy in the existing unified_google_tests suite. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(test): address Greptile review for google-genai proxy SDK tests Restore GOOGLE_APPLICATION_CREDENTIALS after the module proxy fixture tears down, initialize temp-file tracking on the proxy SDK base class, and skip litellm reload for proxy_genai_sdk tests so the module-scoped proxy server stays consistent. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(test): only load Vertex credentials when keys exist for proxy SDK tests Avoid writing empty GOOGLE_APPLICATION_CREDENTIALS temp files so Vertex tests skip cleanly without credentials, use a session-scoped proxy fixture, and clean up per-test credential temp files. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(test): scope google-genai pin to unified_google_tests only Remove google-genai from the ci dependency group and pin it in tests/unified_google_tests/requirements.txt for local test installs. Co-authored-by: Cursor <cursoragent@cursor.com> * test(google): tie litellm reload skip to proxy fixture dependency Replace the name-based reload guard with a check on whether the test requests the google_genai_proxy_url fixture, so the skip stays correct if the proxy SDK tests are renamed. * fix(test): stop DatabaseURLSettings tests leaking DATABASE_URL into os.environ The autouse env scrubber relied on monkeypatch.delenv, but apply_to_env writes DATABASE_URL straight into os.environ, which monkeypatch never tracks and therefore never undoes. The synthesized writer.example.com URL leaked past the last test in this module and into proxy-infra tests that read DATABASE_URL to decide whether to hit a real database, e.g. test_deprecated_key_grace_period_cache_hit_path, turning an intended skip into a ConnectError. Snapshot and restore the managed vars directly so the original environment is reinstated regardless of how it was mutated. * test(google): drop redundant per-test vertex credential setup The session-scoped google_genai_proxy_url fixture already configures GOOGLE_APPLICATION_CREDENTIALS before the proxy starts, and _require_proxy_sdk skips when credentials are missing, so the per-test _setup_vertex_credentials_if_needed helper and its temp-file tracking never did any work. Remove it to keep the ABC self-contained. * test(google): declare model_config contract on proxy SDK ABC _skip_reason_if_credentials_missing reads self.model_config to pick the provider, but that property was only declared on the sibling BaseGoogleGenAITest. Make the dependency explicit by adding model_config as an abstract property on BaseGoogleGenAIProxySDKTest so the ABC is self-contained and a standalone subclass fails fast instead of hitting an AttributeError. * test(google): narrow streaming error catch to Exception Catching BaseException in the streaming assertion swallowed KeyboardInterrupt and SystemExit, turning a Ctrl-C into a test failure message instead of letting pytest interrupt cleanly. Only genuine runtime errors should be recorded as stream failures, so catch Exception. * test(google): initialize proxy on the same loop that serves it The proxy was initialized via asyncio.run() on the main thread, which creates and tears down a throwaway event loop, while requests were served on a separate loop in the worker thread. Any asyncio primitive bound to the init loop would be unusable once serving started. Run initialize() on the worker thread's loop right before server.serve() so setup and request handling share a single event loop. * test(google): drop redundant google-genai requirements pin google-genai>=1.37.0,<2.0 is already declared in the proxy-runtime extra, which the google_generate_content_endpoint_testing CI job installs via uv sync --all-extras. The standalone tests/unified_google_tests/requirements.txt duplicated that pin with a narrower ==1.37.0 specifier and was never installed by CI, so it added a second source of truth without changing what gets installed. Drop it and rely on the proxy-runtime extra. * chore: revert incidental uv.lock exclude-newer bump The google-genai ci pin was added and then dropped (it is already provided by the proxy-runtime group), but each uv lock recomputed the relative exclude-newer span, leaving only a timestamp bump in uv.lock. Restore it to the base value so this test-only PR carries no lockfile change. --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com>	2026-06-05 21:05:32 +00:00
Sameer Kankute	d671a09c20	Litellm oss staging 050626 (#29774 ) * Mark xAI models retiring on 2026-05-15 (#28788) Per https://docs.x.ai/developers/migration/may-15-retirement, xAI is retiring the following slugs on 2026-05-15 (auto-redirect to grok-4.3 with various reasoning efforts; callers continuing to use the old slugs will be billed at grok-4.3 pricing): grok-4-1-fast-reasoning{,-latest} -> grok-4.3 (low effort) grok-4-1-fast-non-reasoning{,-latest} -> grok-4.3 (none) grok-4-fast-reasoning -> grok-4.3 (low effort) grok-4-fast-non-reasoning -> grok-4.3 (none) grok-4-0709 -> grok-4.3 (low effort) grok-code-fast-1{,-0825} -> grok-build-0.1 grok-3 -> grok-4.3 (none) Only the direct xai/ slugs are tagged; third-party hosts (azure_ai, oci, vercel_ai_gateway, perplexity/xai) run their own schedules. The grok-3 retirement list explicitly names only the base grok-3 slug — the -mini / -fast / -beta / -latest variants are not listed, so they remain untouched. * feat(moonshot): advertise json_schema response support on live models (#29683) litellm.responses() already routes Moonshot through the responses->chat-completions bridge, and Moonshot honors response_format json_schema on chat completions. The cost-map entries left supports_response_schema unset, so discovery layers that gate on that flag dropped Moonshot from structured-output / responses listings even though the capability works end to end. Set supports_response_schema on the nine models currently live on api.moonshot.ai: kimi-k2.5, kimi-k2.6, the moonshot-v1 8k/32k/128k text and vision-preview variants, and moonshot-v1-auto. Verified against the live API that each honors json_schema and that litellm.responses() returns schema-valid structured output through the bridge. * chore(moonshot): mark models retired from api.moonshot.ai as deprecated (#29685) Thirteen Moonshot/Kimi models in the cost map no longer resolve on api.moonshot.ai (all return 404). Stamp each with its deprecation_date from platform.kimi.ai/docs/models rather than deleting the entries, so historical cost calculation keeps resolving the names while tooling can surface the retirement. Dates: kimi-thinking-preview 2025-11-11; kimi-latest and its 8k/32k/128k context variants 2026-01-28; the kimi-k2 preview/turbo/thinking series 2026-05-25; the moonshot-v1 -0430 snapshots use their own 2024-04-30 snapshot date (Moonshot publishes no discontinuation date for them). * fix(moonshot): drop temperature for reasoning models (kimi-k2.5/k2.6) (#29687) Kimi reasoning models reject every temperature except 1; a request with temperature=0.2 returns "invalid temperature: only 1 is allowed for this model". litellm only clamped temperature into [0.3, 1], so any value below 1 still 400'd. Drop the temperature param entirely for reasoning models (gated on supports_reasoning, the same signal transform_request already uses) so the model default is used; the non-reasoning moonshot-v1 models keep the existing clamp. Co-authored-by: Sameer Kankute <sameer@berri.ai> * feat(mcp): add per-server timeout configuration (#29672) * feat(mcp): add per-server timeout configuration * fix(mcp): address timeout field review comments - use is not None guard instead of or for 0.0 edge case - copy timeout in both LiteLLM_MCPServerTable constructions (health check path + _build_mcp_server_table) - add timeout Float? column to all three schema.prisma files - extend round-trip test to cover _build_mcp_server_table direction - add test for zero timeout not treated as falsy * fix(mcp): forward timeout in _build_temporary_mcp_server_record * fix(mcp): return 504 instead of 500 when per-server timeout fires * test(mcp): add 504 timeout regression test; fix black formatting * Add jp. Bedrock cross-region inference profile for claude-opus-4-7 (#28567) * fix(thinking): handle None thinking param in is_thinking_enabled (#28598) Squash-merged by litellm-agent from Terrajlz's PR. * feat(helm): support tpl rendering in podAnnotations (#28609) Squash-merged by litellm-agent from devauxbr's PR. * Forward custom_llm_provider through the Responses API bridge (Fixes #28505) (#28575) * Forward custom_llm_provider through the Responses API bridge (Fixes #28505) When a Chat Completions request to a GPT-5.4+ model contains both `tools` and `reasoning_effort`, `completion()` auto-routes through `responses_api_bridge`. The bridge handler called `litellm.responses()` / `litellm.aresponses()` without forwarding the already-resolved `custom_llm_provider`, so the downstream call re-invoked `get_llm_provider()` with `custom_llm_provider=None` and stripped a second provider prefix from a `provider/provider/model` deployment string. For a deployment configured as `openai/openai/openai/gpt-5.5`, the bridge flow sent `openai/gpt-5.5` to the upstream API instead of the correct `openai/openai/gpt-5.5`. Upstream APIs that enforce model-name allow-lists rejected this as `key_model_access_denied`. Fix: pass the locally-resolved `custom_llm_provider` into both the sync `responses()` and async `aresponses()` calls so the downstream `_resolve_model_provider_for_responses` sees an explicit provider and skips the second prefix-strip. New regression test `tests/test_litellm/completion_extras/test_responses_bridge_provider_propagation.py` pins both call sites: each must forward `custom_llm_provider`. * fix(28505): set custom_llm_provider on request_data instead of as duplicate kwarg Greptile flagged that the previous patch passed custom_llm_provider as an explicit kwarg to responses()/aresponses() while request_data already carried it via the spread of sanitized_litellm_params, which would raise TypeError: got multiple values for keyword argument on every real bridge call. Switches to assigning request_data['custom_llm_provider'] before the call so the resolved provider wins over whatever sanitized_litellm_params spread in, without duplicating the kwarg. Updates the regression test to seed request_data with a sentinel custom_llm_provider so it actually exercises the overwrite path (the previous test mocked transform_request with a minimal dict and never hit the conflict). * chore: trigger shin-agent re-eval on retargeted staging base * chore: trigger shin-agent re-eval against updated Greptile state * Add jp. Bedrock cross-region inference profile for claude-opus-4-7 AWS Bedrock documents jp.anthropic.claude-opus-4-7 alongside the existing us./eu./au./global. profiles for Claude Opus 4.7 (ap-northeast-1 Tokyo / ap-northeast-3 Osaka), but the entry is missing from model_prices_and_context_window.json. Tokyo-region users currently get an "unknown model" error when routing through the JP geo profile. Adds the entry to both the canonical file and the bundled backup, mirroring the recent pattern for sonnet-4-6 (#27831). Pricing matches the other regional profiles (10% premium over base/global). Regression test pins all six documented profiles (base, global, us, eu, au, jp) and asserts pricing parity between jp. and au. variants. Source: https://docs.aws.amazon.com/bedrock/latest/userguide/model-card-anthropic-claude-opus-4-7.html --------- Co-authored-by: Terrajlz <info@jouleselectrictech.com> Co-authored-by: Bruno Devaux <devaux.br@gmail.com> Co-authored-by: Sameer Kankute <sameer@berri.ai> * feat(soniox): add soniox audio transcription integration (#29508) * feat(openmeter): add OPENMETER_TRUST_REQUEST_USER to prevent forged attribution (#29650) The OpenMeter callback resolves the CloudEvent subject from kwargs["user"] first, then falls back to the key-bound user_api_key_user_id. For multi-tenant proxy deployments, a client can set `"user": "..."` in the request body and cause their usage to be attributed to that arbitrary string — a billing-attribution forgery risk. Adds OPENMETER_TRUST_REQUEST_USER env var (default "true" for backward compatibility). When set to "false", the request-supplied `user` field is ignored and the subject is resolved solely from user_api_key_user_id. Matches the existing env-var-driven config pattern in this file (OPENMETER_API_KEY, OPENMETER_API_ENDPOINT, OPENMETER_EVENT_TYPE). * feat(search): add you_com as a search provider (#28370) * feat(search): add you_com as a search provider Registers You.com Search API as a first-class `search_provider` in the `search_tools` registry, alongside Tavily, Exa, Perplexity, etc. - New adapter: litellm/llms/you_com/search/transformation.py - POSTs to https://ydc-index.io/v1/search - Auth: X-API-Key from YOUCOM_API_KEY (or explicit api_key) - Maps Perplexity unified spec: max_results -> count, search_domain_filter -> include_domains, country -> country - Flattens results.web + results.news into a single SearchResult list; snippet prefers snippets[0], falls back to description; page_age -> date - Registry: SearchProviders.YOU_COM in litellm/types/utils.py and wired into ProviderConfigManager.get_provider_search_config() - Pricing entry: model_prices_and_context_window.json (placeholder $0.0; happy to adjust to maintainers' preferred public number) - Docs: example router config snippet and example proxy yaml updated - Tests: tests/search_tests/test_you_com_search.py - 5 mocked tests (payload shape, domain filter mapping, snippet fallback, news flattening, missing-api-key error) Refs upstream expansion signal: #15942 * review fixups: normalize api_base, lowercase country, scope env-var to test Addresses Greptile inline review comments on #28370: - get_complete_url: strip trailing slashes from api_base before the endswith("/v1/search") check, so a custom base like ".../v1/search/" doesn't become ".../v1/search/v1/search". - transform_search_request: .lower() country before sending, matching Tavily's convention so callers using the unified spec form ("US") get consistent behavior across providers. - Tests: replace direct os.environ writes with an autouse monkeypatch fixture so YOUCOM_API_KEY is set per-test and removed afterwards. The missing-key test now uses monkeypatch.delenv. New test asserts the trailing-slash normalization above. Reverts the ARCHITECTURE.md / example yaml edits per the reviewer note that documentation changes belong in the litellm-docs repo. * support keyless free tier (api.you.com/v1/agents/search) as default You.com offers an IP-throttled keyless endpoint that returns the same response shape as the keyed one (~100 queries/day, no signup). This is a significant onboarding lever - mirrors the keyless DuckDuckGo/SearXNG providers already in the search_tools registry. Behavior: - YOUCOM_API_KEY set -> keyed: POST https://ydc-index.io/v1/search (X-API-Key header) - no key -> free: POST https://api.you.com/v1/agents/search (no auth) - YOUCOM_API_BASE override -> honored as-is Tests: - New: test_you_com_search_keyless_free_tier - asserts URL + absence of X-API-Key when no key is configured. - New: test_you_com_search_validate_environment_keyless - asserts the config no longer raises when the key is absent. - Removed: test_you_com_search_raises_without_api_key (the precondition no longer holds). - Existing payload/domain-filter/etc tests still cover keyed mode via the autouse YOUCOM_API_KEY fixture. Verified both endpoints accept POST + return identical JSON shape: results.web[] / results.news[] with title, url, snippets, description, page_age. * register you_com in provider_endpoints_support.json Adding `litellm/llms/you_com/` requires a corresponding entry in provider_endpoints_support.json or the code-quality/check_provider_folders_documented CI check fails. Follows the compact tavily/serper pattern - endpoints: { search: true }. Local run of the check now reports "All 114 provider folders are documented". * move tests under tests/test_litellm/llms/ so CI exercises them The litellm CI workflows scope unit tests to `tests/test_litellm/...` (see test-unit-llm-providers.yml: `tests/test_litellm/llms` path), so tests living under `tests/search_tests/` are never run in CI - which is why codecov reports 0% patch coverage for the new adapter even though the unit tests exist and pass locally. Move test_you_com_search.py into `tests/test_litellm/llms/you_com/` so the test-unit-llm-providers job picks it up. 7/7 tests still pass at the new location. (Sibling search-only providers - tavily, exa_ai, brave, etc. - still live only in `tests/search_tests/` and would benefit from the same move, but that is out of scope for this PR.) * fix(you_com): pin Accept-Encoding: identity to dodge keyless gzip bug The keyless free-tier endpoint (api.you.com/v1/agents/search) advertises Content-Encoding: gzip but returns a body that httpx's decoder rejects with `zlib.error: Error -3 while decompressing data: incorrect header check`, surfacing as litellm.APIConnectionError in user code. curl works because it doesn't request compression by default. Pin Accept-Encoding: identity in validate_environment so the upstream server skips compression entirely. Harmless on the keyed endpoint (ydc-index.io/v1/search) which negotiates content-encoding correctly. The header uses setdefault so a caller-supplied Accept-Encoding still takes precedence. (Server-side bug has been flagged to the You.com team separately - once fixed there, this workaround can be removed.) New unit test: test_you_com_search_pins_identity_accept_encoding. --------- Co-authored-by: Sameer Kankute <sameer@berri.ai> * docs: fix README typo (#29419) Correct clear spelling mistakes in documentation without changing behavior. Confidence: high Scope-risk: narrow Tested: git diff --check; uvx codespell on changed files Not-tested: Full docs build not run; text-only changes * Fix(langfuse): pass httpx_client to Langfuse in langfuse_prompt_management to respect SSL_VERIFY (#29480) * fix(langfuse): pass ssl_verify to Langfuse httpx client * fix_langfuse_ * add unit tests * addressed comments --------- Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> * feat(models): add minimax/MiniMax-M3 to model cost map (#29412) Add MiniMax's new flagship MiniMax-M3 to the native minimax provider: 512K context, 128K max output, native multimodal (supports_vision), reasoning, prompt caching. Pricing (USD/M tokens): input 0.6 / output 2.4 / cache read 0.12. M3 has no active prompt-cache-write tier, so cache_creation_input_token_cost is omitted. Updated both the root model_prices_and_context_window.json (remote source) and the bundled litellm/model_prices_and_context_window_backup.json (local fallback), keeping them in sync. * fix(logging): handle ResponseCompletedEvent in anthropic_messages streaming spend log (#29394) * fix(logging): handle ResponseCompletedEvent in anthropic_messages streaming spend log * fix(logging): extend terminal event handling to ResponseIncompleteEvent and ResponseFailedEvent; fix return type annotation * feat(provider): Add Neosantara provider as OpenAI Compatible (#29646) * Add Neosantara provider * Register Neosantara provider enum * Address Neosantara provider review feedback * Add Neosantara packaged endpoint support --------- Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> * fix: address greptile and veria review feedback - langfuse: guard httpx_client injection behind version check (>= 2.7.3) - soniox: propagate audio_transcription_duration in _hidden_params for spend tracking - soniox: give SONIOX_API_BASE env var priority over caller-supplied api_base - mcp: replace CancelledError catch with asyncio.wait_for + TimeoutError * chore(mcp): add migration for per-server timeout column * fix(test): add tool_use_system_prompt_tokens to model prices schema validator * fix: mcp timeout test uses real asyncio.wait_for timeout; you_com get_complete_url respects resolved api_key * fix: forward resolved api_key into you_com endpoint selection and apply timeout to soniox polling GETs The search flow resolves api_key in validate_environment but never passed it into get_complete_url, so a programmatic api_key (with no YOUCOM_API_KEY in the env) set the X-API-Key header yet still selected the keyless free-tier endpoint. Forward api_key through both the search entrypoint and the http handler so the keyed endpoint is chosen. HTTPHandler.get/AsyncHTTPHandler.get had no timeout parameter, so the Soniox poll and transcript-fetch GETs silently used the client global default instead of the caller timeout. Add a per-request timeout to get() and forward the configured timeout from the Soniox handler. * fix(soniox): price stt-async-v4 per second so transcriptions are billed The handler stores audio_transcription_duration in _hidden_params, but the model carried only token cost fields and the response has no token usage, so the transcription cost path fell through to cost_per_second and returned $0. An authenticated caller could transcribe Soniox audio without decrementing their budget. Switch the entry to output_cost_per_second at Soniox's published $0.10/hour async rate so the stored duration produces a real charge. * fix(langfuse): use a dedicated httpx client for the SDK injection The httpx_client handed to the Langfuse SDK came from _get_httpx_client(), which returns LiteLLM's globally cached HTTPHandler. If Langfuse closed that client on teardown it would invalidate the shared client used by every other LiteLLM HTTP call. Build a dedicated httpx.Client instead, still resolving SSL verification and client certificate from LiteLLM's configuration. * fix(soniox): prefer caller-supplied api_base over SONIOX_API_BASE env var * fix(cohere): support max_completion_tokens on cohere v2 chat (default route) (#29779) * fix(cohere): support max_completion_tokens on cohere v2 chat The default cohere_chat route resolves to CohereV2ChatConfig, which did not list or map max_completion_tokens, so get_optional_params raised UnsupportedParamsError for the standard OpenAI parameter (the modern replacement for the deprecated max_tokens). The v1 config already maps it to cohere's max_tokens; mirror that in v2 and add v2 regression tests. * fix(cohere): make max_completion_tokens take precedence over max_tokens on v2 When both max_tokens and max_completion_tokens are supplied, prefer max_completion_tokens explicitly rather than relying on dict iteration order, and cover both orderings with a regression test. --------- Co-authored-by: Daniel Yudelevich <4537920+yudelevi@users.noreply.github.com> Co-authored-by: hectorc98 <hector.chamorroalvarez@adyen.com> Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com> Co-authored-by: Terrajlz <info@jouleselectrictech.com> Co-authored-by: Bruno Devaux <devaux.br@gmail.com> Co-authored-by: Dan Lemon <dan@danlemon.com> Co-authored-by: Saswat <saswatds@users.noreply.github.com> Co-authored-by: Brian Sparker <brainsparker@users.noreply.github.com> Co-authored-by: Zhao73 <156770117+Zhao73@users.noreply.github.com> Co-authored-by: Urain Ahmad Shah <60431964+urainshah@users.noreply.github.com> Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: kape <168134658+kapelame@users.noreply.github.com> Co-authored-by: danisalvaa <159898202+danisalvaa@users.noreply.github.com> Co-authored-by: Just R <remixingmagelang@gmail.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com> Co-authored-by: abhay23-AI <abhaytrivedi22@gmail.com>	2026-06-05 13:51:51 -07:00
ryan-crabbe-berri	4a5644d51e	refactor(ui): centralize proxy base URL resolution into tested resolver (#29793 ) * refactor(ui): centralize proxy base URL resolution into tested resolver The API base URL join logic was hand-rolled inside networking.tsx and re-derived inline at hundreds of call sites, with no test coverage and a latent double-slash bug when the base carried a trailing slash. This pulls the join into a single pure resolveApiBase() with full unit coverage and routes the existing resolution through it, also de-duplicating the env precedence ladder that was copied in two places. * test(ui): assert root-path redirect joins prefix exactly once The existing toContain check accepts a doubled separator; tighten it to a strict prefix match plus a no-double-slash assertion so a regression in the resolveApiBase origin+SERVER_ROOT_PATH join is caught end-to-end.	2026-06-05 11:53:26 -07:00
tin-berri	a4f57032e0	fix(ui): route MCP playground auth by oauth2 mode instead of token_url (#29714 ) Interactive PKCE and OBO servers were mislabeled as M2M, so passthrough never showed the Authorize gate; classify by oauth2_flow + delegate_auth_to_upstream instead.	2026-06-05 10:51:46 -07:00
Mateo Wang	84247d954d	test(ci): record/replay OpenAI image gen so the spend E2E isn't outage-bound (#29787 ) * test(ci): record/replay OpenAI image gen so the spend E2E isn't outage-bound The dockerized spend test test_key_info_spend_values_image_generation curls the proxy for a gpt-image-1 image, which wildcard-routes to real api.openai.com on every commit; an OpenAI outage then reddens unrelated PRs and each run pays for an image. Add an in-repo record/replay reverse proxy (tests/_openai_record_replay_proxy.py) that sits between the proxy and OpenAI. The first run, and the first after the recording lapses, records live; subsequent runs replay from the shared Redis cassette store. The proxy keeps its real separate-process HTTP topology; only the image model's api_base is pointed at the recorder in CI via IMAGE_GEN_RECORDER_BASE_URL, which is unset elsewhere so it falls back to api.openai.com. Recordings lapse 24h after write and are never refreshed on read, matching the VCR persister contract, so provider drift is still caught. Replayed responses drop upstream framing/server headers (content-length, transfer-encoding, content-encoding, date, server) so the re-serving layer recomputes them, honoring the Bedrock content-length lesson. * test(ci): close recorder http client on app shutdown Add a Starlette lifespan that closes the self-created httpx.AsyncClient on teardown, and leave caller-injected clients untouched so reuse across create_app calls is not broken. Covers the unclosed-client ResourceWarning raised in review.	2026-06-05 10:27:23 -07:00
Mateo Wang	939cff0455	test(vcr): stop refreshing cassette TTL on read so cassettes lapse after 24h (#29784 ) The Redis cassette persister slid the 24h TTL forward on every successful read, so any cassette replayed at least once per day never expired. With CI running more than once a day that means a recorded response is replayed forever and the suite never re-hits the provider, so a changed request or response contract goes undetected indefinitely. Drop the refresh-on-read. The TTL now counts down from the last write, so a cassette lapses 24h after it was recorded and the next run past that point re-records live and catches provider drift. Per-commit runs in between still replay from cache; only the one boundary-crossing run goes live.	2026-06-05 10:22:41 -07:00
Sameer Kankute	074455c138	fix(auth): expand all-team-models sentinel in can_key_call_model for batch validation (#29746 ) * fix(auth): expand all-team-models sentinel in can_key_call_model Keys with models=["all-team-models"] were denied during batch JSONL model validation because can_key_call_model matched the literal string against the model name. Add _resolve_key_models_for_auth_check to expand the sentinel to team_models before the check, consistent with get_key_models in model_checks.py and the completion-route bypass. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(auth): document empty team_models unrestricted access behavior; add regression test Adds a docstring note to _resolve_key_models_for_auth_check explaining that when team_models is empty, all-team-models resolves to [] which is treated as unrestricted access (consistent with get_key_models behavior on other auth paths). Adds a test to lock in this behavior. * fix(auth): deny all-team-models access when key has no team_id A key configured with models=["all-team-models"] but no team_id could previously resolve to an empty allowlist, which _check_model_access_helper treats as unrestricted access. Now the sentinel is only expanded when team_id is set; otherwise the unresolved sentinel stays in the model list and causes a deny (no real model name matches it). Same fix applied to get_key_models in model_checks.py for consistency across batch and non-batch auth paths. * style: black format model_checks.py * Fix batch all-team-models auth * style: black format batch_rate_limiter.py * fix(test): add tool_use_system_prompt_tokens to model prices schema validator * fix(batch): catch get_team_object errors to avoid 404 escaping batch auth * fix(batch): apply per-member model scope check after team auth in batch validation * Fail closed on batch team auth fetch errors * test(batch): cover team_object grant and member-scope denial in batch auth --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>	2026-06-05 09:04:45 -07:00
Sameer Kankute	89f177b7b6	fix(galileo): use ingest traces API and standard logging payload (#29651 ) * fix(galileo): use ingest traces API and standard logging payload Switch hosted Galileo logging to /ingest/traces with nested trace/span payloads, read metrics from standard_logging_object, and include cost and total tokens on trace metrics. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(galileo): route username/password auth to v2 traces ingest Hosted Galileo no longer serves /observe/ingest; JWT login should post the same trace payload to /v2/projects/{project_id}/traces. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(galileo): address Greptile review on logging and timestamps Use debug-level logs for per-request Galileo callback messages and fall back to start_time/end_time when standard_logging_object omits startTime/endTime. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(galileo): add Galileo to proxy UI callback configuration Expose Galileo in the admin callback selector and config APIs so credentials can be configured through the dashboard instead of YAML only. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(galileo): align response type logging with Langfuse Mirror Langfuse input/output handling for rerank, speech, transcription, realtime, pass-through, and other response types so Galileo ingest no longer skips supported call types. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(galileo): redact trace payload in debug logs and format with black Avoid logging prompts and model responses in flush debug output while keeping structural metadata for troubleshooting. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(galileo): stop logging full trace payload in debug output Log only flush URL and trace count so prompts and model responses are not written to application logs when debug logging is enabled. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix Galileo token totals and prompt messages --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-05 09:03:17 -07:00
Mateo Wang	ffd0e9fa7f	[internal copy of #27491 ] fix(realtime): Fix Realtime Audio Token Cost Tracking (#29722 ) * Normalize Realtime usage dict keys before ResponseAPIUsage transform * Test usage transform for Realtime versus tokens_details keys * Avoid usage_input dict in-place * Fix audio cost calculation * fix(responses): forward output audio_tokens into completion usage details Pass audio_tokens from output_tokens_details into CompletionTokensDetailsWrapper so cost can use output_cost_per_audio_token. Support dict output details like prompt path. Extend tests for Realtime and mixed completion audio. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix audio token usage formatting * style: Black-format Realtime usage and completion usage merge Resolve combine_usage_objects and responses/utils wrapping for CI black --check. Restore model_fields comments above completion_tokens_details merge loop. Co-authored-by: Cursor <cursoragent@cursor.com> * Add test to cover combined usage objects * Fix merge conflict with test cases Removed unnecessary import statement and cleaned up assertions in test. * fix(cost_calculator): remove dead None guard in completion_tokens_details combiner --------- Co-authored-by: Liam McDonald <lmcdonald@godaddy.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-05 18:53:17 +05:30
michelligabriele	3f79222350	fix(proxy): persist oauth2_flow on MCP server registration (#29690 )	2026-06-05 18:52:52 +05:30
Mateo Wang	1c741b91c0	fix(anthropic): route Claude Opus 4.8 through adaptive thinking (#29702 ) * fix(anthropic): route Claude Opus 4.8 through adaptive thinking Opus 4.8 uses the same adaptive thinking contract as 4.6/4.7 (thinking.type=adaptive plus output_config.effort), but _is_adaptive_thinking_model only recognized 4.6/4.7 by name and otherwise leaned on the supports_adaptive_thinking cost-map flag. The Bedrock, Vertex, and Azure 4.8 entries don't carry that flag, so a bedrock/us.anthropic.claude-opus-4-8 request fell back to the legacy thinking.type=enabled shape and Bedrock rejected it with "thinking.type.enabled is not supported for this model". Add _is_claude_4_8_model and wire it in next to the existing 4.6/4.7 matchers in the adaptive-thinking detection, the effort=max gate, and the supported-params check, so every provider path treats 4.8 as adaptive regardless of whether its cost-map entry advertises the flag. * refactor(anthropic): drive Opus 4.8 adaptive thinking from the cost map Replace the _is_claude_4_8_model name matcher with cost-map data. Add supports_adaptive_thinking to every Opus 4.8 provider variant (Bedrock regional/global, Vertex, Azure) in both the root and bundled cost maps, and move the prefix-resolving capability lookup (_supports_model_capability) down to AnthropicModelInfo so _is_adaptive_thinking_model reads the flag through the bedrock/invoke/, bedrock/, and vertex_ai/ prefixes. The 4.6/4.7 name checks stay as a fallback since their provider entries don't carry the flag yet. A pure data fix is not enough on its own: _supports_factory doesn't strip the us.anthropic./invoke/ prefixes, so bedrock/invoke/us.anthropic.claude-opus-4-8 would still miss the flag without the resolver change. Add a cost-map guardrail test asserting every claude-opus-4-8 variant carries the flag, so a future variant added without it fails CI instead of silently sending the legacy thinking.type=enabled shape that the provider rejects.	2026-06-05 16:19:01 +05:30
Mateo Wang	8259d6cd85	fix: small CLAUDE.md nit (#29749 )	2026-06-05 06:30:05 +00:00
Mateo Wang	778a7f752d	Support OAuth M2M for Databricks Apps A2A agents (#29586 ) * Add OAuth M2M support for A2A agents targeting Databricks Apps Databricks App endpoints reject static bearer tokens and require a short-lived OAuth token minted via the workspace OIDC token endpoint. A2A agents could previously only authenticate outbound with static_headers or client header passthrough, so Databricks App agents could not be registered. Agents configured with a databricks_oauth block in litellm_params now mint and cache a client_credentials token and attach it as the outbound Authorization header on both message/send and message/stream calls, overriding any statically configured Authorization. * Add tests covering Databricks App OAuth token error paths Cover the HTTP status error, transport error, non-object JSON body, and invalid expires_in fallback branches in the token cache so the failure handling is locked in by regression tests. * Harden Databricks App OAuth token cache Cap the cache TTL at the token's own lifetime so a token whose validity is shorter than the refresh buffer is never cached and served stale; include a digest of client_secret in the cache key so a rotated secret mints a fresh token instead of reusing the old one; and prune the per-key lock when its cached token is evicted so the lock map stays bounded by the live key set. * Clear per-key locks on Databricks OAuth cache flush * fix(a2a/databricks): mint OAuth token via Basic auth header, not unsupported auth= kwarg litellm's AsyncHTTPHandler.post (what get_async_httpx_client returns) has no auth parameter, so minting a Databricks App OAuth token raised "AsyncHTTPHandler.post() got an unexpected keyword argument 'auth'" before any network call ever left the proxy, breaking the feature end to end. The handler also calls raise_for_status() internally and re-raises a MaskedHTTPStatusError (a subclass of httpx.HTTPStatusError), so the explicit raise_for_status() after post() was dead code. Build the HTTP Basic Authorization header by hand and pass it via headers, which is what the Databricks workspace OIDC token endpoint documents for client authentication. The token-cache tests now model the real handler contract with create_autospec so the rejected auth= signature is enforced; the previous mocks accepted any kwargs and silently hid the bug. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * Prune Databricks OAuth lock on the short-lived-token path When expires_in is below the refresh buffer the token is intentionally not cached, so _remove_key never runs for that key and the per-key lock created by _get_lock leaked permanently. Drop the lock in that branch so _locks stays bounded by the live key set, and assert the cleanup in the short-lived-token test * Gate A2A Databricks OAuth on the databricks_oauth block at the call site Make the gating explicit where the header is applied so it is clear that only agents configured with a databricks_oauth block enter the OAuth path; every other agent is left untouched. Add a regression test asserting a non-Databricks agent never invokes the token resolver. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>	2026-06-04 23:03:37 -07:00
Sameer Kankute	2b7c97bff6	fix(vertex/anthropic): handle namespace tools and strip client_metadata for codex compatibility (#29489 ) * fix(vertex/anthropic): handle namespace tools and strip client_metadata for codex compatibility * fix(anthropic): cast nested namespace tools to fix mypy error, skip nameless flat tools	2026-06-04 22:57:16 -07:00
Mateo Wang	df704d9016	fix(proxy/hooks): populate llm_provider on internal rate-limit errors (#27707 ) * feat(proxy/hooks): add ProxyHTTPRateLimitError + provider resolver Introduces a small helper layer used by every proxy-side rate-limit hook so that the 429 they raise carries a populated llm_provider / model — instead of an empty exception.llm_provider that downstream loggers (Prometheus failure metric, observability callbacks) read as 'no provider attribution'. ProxyHTTPRateLimitError inherits from both fastapi.HTTPException (so the proxy server still renders it as a 429) and litellm.exceptions.RateLimitError (so isinstance checks and PrometheusLogger._get_exception_class_name pick up llm_provider). We deliberately don't call RateLimitError.__init__ — it constructs an httpx.Response we don't need and would just add failure surface; attribute parity is what downstream consumers care about. resolve_llm_provider_for_rate_limit() wraps litellm.get_llm_provider defensively. Internal limiter hooks fire from async_pre_call_hook — well before get_llm_provider runs anywhere else in the request lifecycle — so we have to call it ourselves at raise time. If the model is missing or unparseable (alias, router-only model) we fall back to llm_provider='litellm_proxy' rather than letting a second exception leak out and break the request path. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(proxy/hooks): populate llm_provider on parallel-request 429s Both v1 and v3 parallel-request limiters fired bare HTTPException(429) from inside async_pre_call_hook. The downstream Prometheus failure metric reads exception.llm_provider via _get_exception_class_name — the empty value showed up as exception_class='HTTPException' and left model_id='None' on the time series. Threads requested_model through every raise site in: * parallel_request_limiter.py: - check_key_in_limits (the per-key/per-model/per-user/per-team/ per-customer over-limit path) - raise_rate_limit_error (zero-limit + global_max_parallel_requests paths) — now takes an optional requested_model kwarg * parallel_request_limiter_v3.py: - _handle_rate_limit_error (the OVER_LIMIT translator), called from both the should_rate_limit pre-check and the TPM reservation path Resolved via resolve_llm_provider_for_rate_limit so unknown / missing models silently fall back to llm_provider='litellm_proxy' instead of breaking the request path with a second exception. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(proxy/hooks): populate llm_provider on dynamic-rate-limit 429s Same plumbing change as the parallel limiters, applied to both dynamic_rate_limiter (v1) and dynamic_rate_limiter_v3: * v1: TPM-zero and RPM-zero paths in async_pre_call_hook now resolve data['model'] -> (model, llm_provider) once and pass it into both raises. * v3: All three raise sites in _check_rate_limits — the model_saturation_check enforced raise, the priority_model enforced raise, and the fail-closed unknown-descriptor branch — now attribute the 429 to the actual provider. Falls back to llm_provider='litellm_proxy' when the model can't be resolved. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(proxy/hooks): populate llm_provider on batch-rate-limit 429s batch_rate_limiter._raise_rate_limit_error now takes a requested_model kwarg threaded from data['model'] in _check_and_increment_batch_counters. The batch-creation 429 is what gets raised when the input file's tokens/requests count would push the per-key TPM/RPM window over its limit. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(proxy/hooks): populate llm_provider on budget/iterations 429s Final batch of internal raise sites — the user/session-budget and max-iterations hooks. Same pattern: resolve data['model'] once at raise time, attach to ProxyHTTPRateLimitError so Prometheus and observability callbacks can attribute the 429. Hooks updated: * max_budget_limiter (per-user max_budget exceeded) * max_iterations_limiter (per-session agent iteration cap) * max_budget_per_session_limiter (per-session dollar cap) All three fall back to llm_provider='litellm_proxy' when data['model'] is missing or unparseable. Drops the now-unused HTTPException import from each module. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(proxy/hooks): pin provider field on internal rate-limit 429s Regression coverage for the 'provider field missing' bug across every proxy-side rate-limit hook + the helper layer: * ProxyHTTPRateLimitError class shape (HTTPException + RateLimitError, dict-detail stringification, None-provider normalization). * resolve_llm_provider_for_rate_limit happy paths (gpt-4o-mini, anthropic/..., bedrock/...) plus all three fallback branches (None, '', unknown name) plus a 'get_llm_provider raises' case that asserts we swallow the secondary exception. * For each limiter (parallel v1/v3, dynamic v1/v3, batch, max_budget, max_iterations, max_budget_per_session): assert the raised exception is a RateLimitError carrying the resolved model + llm_provider, and a sibling test that asserts the fallback path returns 'litellm_proxy' without leaking a second exception. * Two PrometheusLogger._get_exception_class_name pins so the Prometheus failure metric label flips from 'HTTPException' to 'Openai.ProxyHTTPRateLimitError' (or 'Litellm_proxy.' on fallback) — that's what dashboards consume. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> perf(proxy/hooks): defer provider resolution to over-limit branches * fix: use error_message in raise_rate_limit_error to avoid literal 'None' in detail * Consolidate rate_limiter_utils imports in dynamic_rate_limiter * fix(proxy): set num_retries/max_retries on ProxyHTTPRateLimitError ProxyHTTPRateLimitError inherits from RateLimitError but did not call RateLimitError.__init__, so num_retries/max_retries were never set. When Starlette's HTTPException lacks __str__, MRO falls through to RateLimitError.__str__, which unconditionally reads these attributes and raises AttributeError during logging/traceback formatting. Initialize them to None defensively. * fix(mypy): silence base-class status_code conflict on ProxyHTTPRateLimitError HTTPException declares 'status_code: int' while openai.RateLimitError (via APIStatusError) declares 'status_code: Literal[429] = 429'. Mypy flags the multi-base override as [misc] in CI lint. The runtime semantics are fine (we set self.status_code in __init__), so silence the class-level annotation conflict with a targeted ignore. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>	2026-06-04 22:46:08 -07:00
Mateo Wang	812a2217ca	[internal copy of #29511 ] feat(guardrails): add sensitive data routing to on-premise models (#29531 ) * feat(guardrails): add sensitive data routing to on-premise models When a guardrail detects sensitive data, route to an on-premise model instead of blocking or redacting. All subsequent requests in that session continue routing to the same model (sticky routing). New config options for guardrails: - on_sensitive_data: 'block' (default) or 'route' - sensitive_data_route_to_model: target model for rerouting - sticky_session_routing: persist routing for session (default: true) New exception SensitiveDataRouteException triggers rerouting when raised by guardrails. The proxy catches it, stores the routing decision in cache, and modifies the request's model field. New hook _PROXY_SensitiveDataRoutingHandler checks incoming requests against cached routing decisions and applies sticky routing. https://claude.ai/code/session_01SQd4isBa3UyouRoGVou9dK * fix: black formatting for custom_guardrail.py https://claude.ai/code/session_01SQd4isBa3UyouRoGVou9dK * test: improve test coverage for sensitive data routing feature Add additional tests for: - Cache key format and TTL constants - Session ID extraction from multiple locations - Custom guardrail initialization with routing config - Exception string representation and custom messages - Redis cache paths including fallback behavior - Edge cases in pre-call hook https://claude.ai/code/session_01SQd4isBa3UyouRoGVou9dK * fix: use correct GuardrailRaisedException parameters Replace invalid 'source' parameter with 'guardrail_name' to match the exception's actual signature. https://claude.ai/code/session_01SQd4isBa3UyouRoGVou9dK * test: move sensitive data routing tests to hooks directory Move test file to align with source code structure. https://claude.ai/code/session_01SQd4isBa3UyouRoGVou9dK * fix(guardrails): honor sticky_session_routing flag and scope session routing per API key Propagate sticky_session_routing through SensitiveDataRouteException so a guardrail configured with sticky_session_routing=False reroutes only the triggering request without persisting a session override. Scope the routing cache key to the requesting API key so sessions from different tenants cannot collide, and warn when sticky routing is requested but the hook is not registered. * refactor(guardrails): dedupe session-id extraction and drop redundant import Extract the shared session-id lookup into get_session_id_from_request_data so the sensitive-data routing hook and CustomGuardrail no longer keep two identical copies of the logic. Remove the redundant local import of GuardrailRaisedException in handle_sensitive_data_detection, and document that detection_info is surfaced in request metadata and logs so it must not carry raw sensitive values. * fix(guardrails): guard None user_api_key_dict in sensitive data route handler * fix(responses): send application/json Content-Type on responses DELETE OpenAI's responses DELETE endpoint now rejects requests that arrive without a Content-Type header, defaulting them to application/octet-stream and returning 'Unsupported content type: application/octet-stream'. The delete handler sent no body and therefore no Content-Type, so the request failed. Declare application/json on the delete request, matching the OpenAI SDK. * fix(guardrails): backfill in-memory cache after redis hit in sensitive data routing When _get_routed_model resolves a routing override from Redis it now also populates the local in-memory cache. Without the write-back, a non-writing instance that only ever reads from Redis would lose the sticky routing decision the moment Redis became unavailable, silently reverting sensitive sessions to the default model. * fix(guardrails): scope sticky sensitive-data routing to JWT principal Keyless auth (JWT and similar) has no api_key, so every such caller shared the "default" cache namespace. One authenticated user could reuse another user's session_id, trip the guardrail, and silently force the other user's subsequent requests onto the cached on-prem model for the TTL. Resolve the routing tenant from the api_key when present, otherwise from a stable principal built from the user/team/org identity, before reading or writing the session route. * fix(guardrails): require route target model when on_sensitive_data='route' * fix(guardrails): mark user_api_key_dict Optional in sensitive-data route handler * fix(guardrails): use remaining redis ttl for local backfill and str env default * fix(guardrails): graceful block when routing configured but no session_id handle_sensitive_data_detection promised to raise only SensitiveDataRouteException or GuardrailRaisedException, but when routing was configured and the request had no session_id it let a ValueError from raise_sensitive_data_route_exception propagate, surfacing as an HTTP 500 instead of a block. Fall back to a graceful block in that case so the documented contract holds. * fix(guardrails): run remaining guardrails after sensitive-data reroute Defer the SensitiveDataRouteException until every guardrail in the pre-call loop has run, so downstream security guardrails are no longer skipped when an earlier guardrail triggers routing. The first reroute wins and a later guardrail that blocks still propagates. Also normalize on_sensitive_data to lowercase like sibling on_* config fields so case-insensitive values are accepted. * fix(guardrails): classify sensitive-data reroute as guardrail intervention * fix(guardrails): record sensitive-data reroute as prometheus intervention not error * fix(guardrails): record service span for routing guardrail and move case-normalizer to base params Drop the early continue so a guardrail that signals sensitive-data routing still emits its PROXY_PRE_CALL service span like every other callback. Move the lowercase normalizer onto BaseLitellmParams so on_sensitive_data is normalized consistently when BaseLitellmParams is constructed directly, matching the cross-field route->model validator that already lives on the base.	2026-06-04 22:22:28 -07:00
yuneng-jiang	56aa55b991	fix(proxy): stop team BYOK model name corruption on model edit (#29731 ) * fix(proxy): stop team model name corruption on edit (#28382) (#29001) Team-scoped ("Team-BYOK") models store an internal routing key model_name_{team_id}_{uuid} in the model_name column and the user-facing name in model_info.team_public_model_name. The internal name leaked into /v1, /v2, and /model/info responses; the dashboard bound its edit form to it, so any non-rename save (e.g. a TPM tweak) PATCHed the internal name back. The update path then treated it as a rename, overwriting team_public_model_name and rewriting the team's models[] ACL with the mangled string -- breaking team key calls with team_model_access_denied. Two-layer fix: - Read path (root cause): add _translate_model_name_for_response and apply it in model_info_v2 and _get_proxy_model_info so /v1, /v2, and /model/info surface the public name for team-scoped rows. The DB column and router index keep the internal name as the routing key; this is a presentation-layer swap on a shallow copy (never mutates input). - Write path (defense in depth): harden _get_public_model_name so a value matching the internal shape, or a no-op against the current DB column, is never treated as a rename -- for both the top-level model_name and an explicit model_info.team_public_model_name. Tests: regression for the reported scenario, full branch coverage of _get_public_model_name, two internal-shape guard cases, an end-to-end PATCH through _update_team_model_in_db (asserts the team ACL is untouched), and four response-translation cases. 60 passed (model management), 181 passed (proxy server). * fix(ui): key Agent Builder agent selection on model_info.id (#29729) * fix(ui): key Agent Builder agent selection on model_info.id Once team-scoped BYOK models can share a public name (the backend now returns the public name on /model/info instead of the internal routing key), selecting agents by model_name collides. Key selection, create, update and delete on the stable model_info.id instead, falling back to model_name only for config-defined agents that have no id. * fix(ui): add name-match fallback to post-create agent selection If the just-created agent's id is not yet present in the re-fetched list, try matching by name before falling back to the first agent. Addresses greptile review on #29729. --------- Co-authored-by: tushar8408 <32977767+tushar8408@users.noreply.github.com>	2026-06-04 20:40:40 -07:00
ryan-crabbe-berri	f3811ce63b	refactor(ui): shared HTTP client + location-pinned fetch() lint rule (#29723 ) * refactor(ui): add shared HTTP client and pin raw fetch() to one file Introduce src/lib/http/client.ts, a single typed wrapper that owns the only fetch() in the dashboard. It centralizes the base URL, the auth header, error parsing (deriveErrorMessage), non-2xx -> thrown ApiError, and JSON parsing, and is framework-agnostic (no React) so it can run from client and, later, server components. The base URL, auth header name and the logout side effect are injected through createApiClient. networking.tsx builds one configured apiClient and the 29 functions whose boilerplate maps exactly to the client's default behavior (canonical deriveErrorMessage + handleError + res.json() template) now call it instead of hand-rolling fetch. Names, signatures, return types and error behavior are unchanged; this is a pure refactor that drops ~440 lines. The no-restricted-syntax fetch rule now points at the client and a files: ["src/lib/http/*"] override makes that the only place fetch() is allowed. Re-baselined eslint-suppressions.json: networking.tsx fetch suppressions drop 270 -> 241; no other rule's counts change. The remaining networking.tsx fetches and the ~61 scattered component/hook fetches diverge from the default client behavior (text() error bodies, no res.ok check, no handleError side effect) and stay grandfathered for a follow-up burndown. fix(ui): make the HTTP client tolerate non-JSON error bodies The non-2xx branch parsed the error body with response.json(), so a gateway returning HTML (502/503 from a reverse proxy) threw a SyntaxError before onError fired or ApiError was built, dropping the user-facing notification. This matched the old per-function behavior, but the client is now the single error path so it is the right place to harden. Read the body as text once, try JSON.parse for the existing deriveErrorMessage path, and fall back to the raw text (or the HTTP status) otherwise. The success path stays strict json() so return types are unchanged. * fix(ui): await the returned apiClient promise in 6 migrated functions The codemod rendered the `return response.json()` tail as `return apiClient.x()` without `await`. Inside the surrounding try/catch that returns an unawaited promise, so the catch never runs and its console.error log is dropped on failure; 4 of the 6 were `return await response.json()` originally, so this restores their exact behavior. Use `return await apiClient.x()` in all six. * refactor(ui): widen onError type and handle empty success bodies Address review notes on the shared client. Type onError as (message: string) => void \| Promise<void> so the fire-and-forget async contract (networking passes the async handleError) is explicit rather than silently discarded by void. On the success path, read the body as text and return undefined for an empty body (e.g. a 204 No Content) instead of throwing a SyntaxError, while still parsing non-empty bodies strictly so a malformed JSON response surfaces rather than being masked. Add tests for the 204 case.	2026-06-04 20:27:58 -07:00
Shivam Rawat	3bd89f209e	Litellm jwt mapping virtualkeys (#28510 ) * restore an explicit no-match policy * fix(jwt): fix AUTO_REGISTER sentinel bypass, race condition, and inline import comment - AUTO_REGISTER now evicts stale __NO_MAPPING__ sentinel instead of silently returning None when cached under a prior fallback_team_mapping config - Race condition in _auto_register_jwt_mapping: catch P2002 unique-constraint violation on concurrent creates, fetch the winning mapping, proceed cleanly - Added comment on inline generate_key_helper_fn import explaining the circular dependency (key_management_endpoints imports user_api_key_auth at line 51) - 3 new tests: stale sentinel eviction, race condition winner fallback, and the existing auto_register happy path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(jwt): cache __NO_MAPPING__ sentinel before raising 403 in REJECT mode REJECT mode was raising HTTPException immediately on a DB miss without writing the __NO_MAPPING__ sentinel, causing every subsequent rejected request to re-query the DB. Write the sentinel first so repeated rejections are served from cache within virtual_key_mapping_cache_ttl. Adds test asserting DB is not hit on the second reject after a cache-warm miss. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(jwt): enforce no-match policy when prisma_client is None The early `if prisma_client is None: return None` guard ran before the no-match policy check, silently bypassing REJECT and AUTO_REGISTER — every JWT client fell through to team auth regardless of configuration. Fix: treat prisma_client=None as a definitive DB miss and fall through to the same policy block as a real miss. REJECT now raises 403, AUTO_REGISTER raises 500 with a clear message (can't create keys without a DB), FALLBACK_TEAM_MAPPING returns None unchanged. Adds three tests: REJECT/403 with no DB, FALLBACK returns None with no DB, AUTO_REGISTER/500 with no DB. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(jwt): consistent AUTO_REGISTER on cached sentinel; clean up race orphans Addresses Greptile review on PR #25570 cherry-pick. 1. Inconsistent AUTO_REGISTER when __NO_MAPPING__ sentinel is cached: The cached-sentinel branch silently returned None when prisma_client was None, while the fresh path raised HTTP 500 under the same config. Same request, different access-control outcome depending on cache state. Both paths now raise the same 500. 2. Orphaned virtual keys from race-condition losers: On unique-constraint conflict, generate_key_helper_fn had already persisted an unrestricted virtual key in LiteLLM_VerificationToken with the cleartext in request memory. Under sustained concurrency these accumulated indefinitely. The loser now deletes its orphan before falling back to the winner's mapping; failure to delete is logged but does not fail the request. Also corrects a latent FK bug surfaced while fixing #2: the mapping row was storing the plaintext key in LiteLLM_JWTKeyMapping.token, but that column FKs to the hashed LiteLLM_VerificationToken.token — now hashed at the call site. Tests: - updated test_auto_register_creates_key_and_mapping to assert the hashed token is stored, not the plaintext - updated test_auto_register_race_condition_unique_conflict to assert the orphan is deleted with the correct hashed token - added test_auto_register_raises_500_when_sentinel_cached_and_no_db - added test_auto_register_race_conflict_tolerates_delete_failure Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jwt): close REJECT bypass when JWT omits the configured claim field A JWT presented without the configured `virtual_key_claim_field` previously returned None at the `claim_value is None` guard before the `unregistered_jwt_client_behavior` check ran. A caller who knows the configured claim-field name could bypass REJECT by simply omitting that field and falling through to team-based JWT auth. Apply the no-match policy on a missing claim: - REJECT → 403 - AUTO_REGISTER → 403 (no stable identity to map; refuse rather than create a sentinel-keyed record) - FALLBACK_TEAM_MAPPING → return None (unchanged, backward-compatible) Adds three tests covering each branch of the missing-claim path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jwt): AUTO_REGISTER inherits team_id so keys are bounded by team limits Auto-registered virtual keys were created with no team, model, route, rate, or budget constraints — broader access than the standard team-based JWT auth path the same client would have taken. Under AUTO_REGISTER, resolve the team_id from the JWT (via the operator-configured team_id_jwt_field / team_id_default) and stamp it on the new key. Downstream auth then applies the team's budget/models/tpm/rpm/allowed_routes via the existing virtual-key flow. Policy when team_id_jwt_field is configured: - JWT carries team claim → stamp resolved team_id - JWT lacks claim + team_id_default set → stamp default - JWT lacks claim + no default → 403 (refuse to create an unbounded key) When neither team_id_jwt_field nor team_id_default is configured, the operator has explicitly opted out of team-based limits — the auto-created key has no team_id (matches what team-auth would do in the same config). Adds 4 tests covering each branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jwt): make AUTO_REGISTER functional in prod; raise on missing winner Two correctness fixes flagged by Greptile on the AUTO_REGISTER path: 1. generate_key_helper_fn was called without table_name="key". Without that, the helper falls into the user-upsert branch (table_name in (None, "user")) and tries to insert into LiteLLM_UserTable with user_id=None, which hits the NOT NULL @id constraint. AUTO_REGISTER would never have succeeded in production. Now passes table_name="key" explicitly, matching the /key/generate caller. 2. When the race loser refetches the winner's mapping and gets None (winner row concurrently deleted), the previous code returned None — and the caller in _resolve_jwt_to_virtual_key then fell through to less- restrictive team-based JWT auth, silently bypassing the configured AUTO_REGISTER policy. Now raises HTTP 503 so the caller retries against a stable state rather than getting unintended fallback access. Adds one test for the 503 winner-vanishes path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jwt): defer AUTO_REGISTER until JWT policy is enforced by auth_builder Closes the JWT policy bypass on the AUTO_REGISTER path flagged by veria-ai. Before: when unregistered_jwt_client_behavior=auto_register and the JWT's claim was unmapped, _resolve_jwt_to_virtual_key validated the JWT signature and then immediately created a virtual key + mapping. JWTAuthManager.auth_builder never ran for the first request (the new key short-circuited the team-auth path), and every subsequent request hit the cached mapping — so custom_validate, RBAC, scope_mappings, and user_allowed_email_domain were never enforced for auto-registered clients. After: _resolve_jwt_to_virtual_key returns a _PendingAutoRegister signal instead of creating the key. The caller in _user_api_key_auth_builder runs JWTAuthManager.auth_builder, then — only on a validated, policy-passing result — calls _auto_register_jwt_mapping with the team_id / user_id from that result. The created key inherits team + user limits from the validated identity, and future cache hits load that already-policy-checked key. Also drops the interim _resolve_inherited_team_id helper that pulled team_id from raw JWT claims — same bypass risk; team_id now comes exclusively from auth_builder. Tests: - Rewrote two existing tests to assert _resolve_jwt_to_virtual_key returns _PendingAutoRegister (no key created yet) for both the fresh-DB-miss and stale-sentinel branches - Added a contract test that _auto_register_jwt_mapping stamps the validated team_id/user_id onto generate_key_helper_fn - Removed four stale team-binding tests that exercised the prior raw-claim helper Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Update user_api_key_auth.py * fix(jwt): cache proxy-admin AUTO_REGISTER path to avoid repeated DB lookups Cache-miss regression introduced by the deferred-auto-register refactor: when a JWT under AUTO_REGISTER resolved to a proxy admin, the is_proxy_admin early-return in _user_api_key_auth_builder ran before the pending auto-register cache-write block. Result: no cache entry, so every subsequent proxy-admin request re-queried get_jwt_key_mapping_object indefinitely. Fix: write a __JWT_PROXY_ADMIN__ sentinel to user_api_key_cache before the early return when a pending auto-register existed. _resolve_jwt_to_virtual_key treats that sentinel as "skip mapping, fall through to auth_builder", so future requests from the same JWT identity hit the cache instead of the DB. auth_builder still runs full JWT policy on every request — only the mapping DB lookup is short-circuited. Adds one test asserting the sentinel cache-hit returns None without hitting prisma_client.db.litellm_jwtkeymapping.find_first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(proxy): stamp org context on JWT auto-registered keys AUTO_REGISTER keys were created with team_id and user_id only, so org budget checks were skipped after switching to the key-scoped path. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 19:00:36 -07:00
ryan-crabbe-berri	41e90a6ada	chore(ui): remove the bare-fetch lint rule (#29712 ) * fix(ui): only flag bare fetch() outside React Query queryFn/mutationFn The frontend lint rule banned every fetch() call by static AST name match, so a fetch wrapped in a React Query queryFn/mutationFn tripped it just like a loose fetch in a component. esquery (no-restricted-syntax) can't express "has ancestor", so this replaces that selector with a small custom rule (local/no-bare-fetch) that exempts a fetch lexically inside a queryFn or mutationFn and reports everything else. Re-baselined eslint-suppressions.json under the new rule id (same 44 files / 331 violations) so existing code keeps its grandfathered suppressions. Adds a RuleTester suite covering wrapped (valid) vs unwrapped, the standalone Api.ts function pattern, queryKey, and computed-key cases. chore(ui): remove the bare-fetch lint rule Drop the fetch lint gate (and its 331 grandfathered suppressions) ahead of the networking refactor. The plan is to centralize all fetching in a single shared http client and enforce that with a location-based rule, so keeping a fetch rule in place now would only block CI while functions are routed through the new client. Removing it unblocks that work; the location-based rule lands with the client in a follow-up.	2026-06-04 18:58:38 -07:00

1 2 3 4 5 ...

39598 Commits