Commit Graph

9956 Commits

Author SHA1 Message Date
yuneng-jiang
1bbaf1c39d
fix(guardrails): read CrowdStrike AIDR identity from both metadata bags (#29991)
Capture user_id and extra_info from metadata or litellm_metadata. The single-bag read dropped identity whenever a request carried a present litellm_metadata field (null or a user-supplied dict), since /chat/completions routes the authenticated identity into metadata while the guardrail read litellm_metadata first
2026-06-08 17:46:28 -07:00
milan-berri
411bd3da5b
feat(vantage): include organization metadata in FOCUS Tags export (#28184)
* feat(vantage): include organization metadata in FOCUS Tags export

Join LiteLLM_OrganizationTable when building Vantage/FOCUS export rows so
organization_id and organization_alias appear in Tags for org-level filtering.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(focus): include api_requests in organization Tags tests

FocusTransformer now requires api_requests after staging merge; add the
column to test fixtures so integrations CI can run the Tags assertions.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-09 02:59:21 +03:00
yuneng-jiang
c24a3603d9
fix(team-management): delete a team's BYOK models when the team is deleted (#29977)
A team's BYOK models (rows in LiteLLM_ProxyModelTable with model_info.team_id set)
were left orphaned when the team was deleted; they lingered in the database and kept
showing on the Models + Endpoints page. delete_team now removes them via a new
delete_team_models helper that deletes the rows in one transaction and syncs the
in-memory router only after that transaction commits, run before the team rows are
deleted so a mid-flight failure never leaves the team gone with its models orphaned
2026-06-08 16:55:35 -07:00
Sameer Kankute
dfd6cbc514
fix(vertex): propagate Vertex AI metadata in streaming success callbacks (#29899)
* fix(vertex): propagate Vertex AI metadata in streaming success callbacks

Streaming calls assembled via stream_chunk_builder were missing
vertex_ai_grounding_metadata and vertex_ai_url_context_metadata in
standard_logging_object.response. Merge metadata from chunks into the
assembled response and mirror non-streaming hidden_params on Gemini chunks.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(vertex): move streaming metadata merge into provider config hook

Address review feedback by delegating assembled-stream metadata propagation
to VertexGeminiConfig via BaseConfig.apply_assembled_streaming_response_metadata,
and only write chunk hidden_params when metadata is non-empty.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(redaction): scrub Vertex provider metadata when message logging is off

Clear vertex_ai_grounding_metadata and related fields from standard
logging responses and assembled streaming ModelResponse objects so
turn_off_message_logging cannot leak prompt-derived web search queries.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Use assembled model for streaming metadata hook

* Fix Vertex metadata redaction bypass in logging callbacks.

Scrub Vertex provider fields from litellm_params.metadata.hidden_params during perform_redaction so streaming success_handler merges do not leak prompt-derived metadata when message logging is disabled.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix Vertex streaming metadata from hidden params

* fix(vertex): mirror vertex_ai_safety_results on assembled streaming responses

The non-streaming transform_response stores safety data under
vertex_ai_safety_results, but the streaming path only wrote
vertex_ai_safety_ratings. Assembled streaming responses therefore never
carried vertex_ai_safety_results, so any consumer reading that field saw
a silent difference between streaming and non-streaming calls.

Set vertex_ai_safety_results alongside vertex_ai_safety_ratings in the
shared stream metadata setter and add it to the assembled metadata field
list so it propagates through stream_chunk_builder.

* fix(streaming): log provider streaming metadata hook failures instead of swallowing them

* refactor(vertex): share single Vertex metadata field tuple across redaction and streaming

* refactor(vertex): move Vertex metadata redaction helpers into llms/vertex_ai

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>
2026-06-08 16:14:30 -07:00
milan-berri
1c881eee5d
fix(fireworks): enable tool calling for glm-5p1 in model cost map (#29697)
glm-5p1 supports native tools on Fireworks; explicit false flags caused
drop_params to strip tools and tool_choice before the provider request.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-08 15:54:19 -07:00
milan-berri
9ccda11919
fix(team_endpoints): don't block /team/update on unchanged team budget (#29525)
On /team/update for a standalone (no-org) team, _check_user_team_limits()
compared the request max_budget against the caller's personal max_budget
whenever max_budget was present in the payload. A team admin whose personal
budget is lower than the team's budget could not edit any field (tpm_limit,
team name, etc.) because the UI re-sends the unchanged max_budget on every
update, tripping the personal-budget check.

Pass the team's current max_budget into _check_user_team_limits() and skip the
personal-budget comparison when the incoming value is unchanged or lower than
the team's current budget. Only genuine increases above the team's current
budget are still validated against the caller's personal limit, so no
over-relaxation. Proxy admins and the org-scoped path are unaffected.

Adds two regression tests for the standalone update path (unchanged budget +
tpm_limit change, and lowering the budget), both for a caller whose personal
budget is below the team budget.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-09 01:14:24 +03:00
milan-berri
a7ecf6b5b1
feat(jwt-auth): opt-in fallback to DB team on unresolved JWT claim (#28913)
* fix(jwt-auth): defer to single-team DB fallback on claim mismatch

Extends the single-team DB fallback introduced in #26418 to two more
cases where it previously could not run:

* `find_and_validate_specific_team_id`: when `team_id_jwt_field` is
  configured and a claim value is present in the token but the team
  does not exist in the LiteLLM DB (HTTPException 404 from
  `get_team_object`), return `(None, None)` instead of raising — the
  auth_builder fallback then attributes the request to the user's
  single DB team. Only HTTPException is caught; other errors (e.g.
  "No DB Connected") still propagate.

* `find_team_with_model_access`: when none of the `team_ids_jwt_field`
  groups resolve to a real LiteLLM team, return `(None, None)` instead
  of raising 403 so the same fallback path runs. If at least one group
  DID resolve to a team but none granted the requested model, the
  original 403 is preserved (legitimate access denial — not a claim
  mismatch). Tracked via the new `any_claim_team_resolved` flag.

The strict `is_required_team_id` raise and `enforce_team_based_model_access`
raise remain unchanged. Unit tests cover both new soft-fail paths and
guard each preserved path (strict required, enforce_team_based, the
preserved 403, and the non-HTTPException propagation).

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(jwt-auth): narrow HTTPException catch to 404 (greptile review)

Address Greptile review comments on #28913:

* `find_and_validate_specific_team_id`: re-raise HTTPException when
  `status_code != 404`, pinning the catch to the "team doesn't exist
  in db" path documented for `get_team_object`. A future change that
  introduces a different status code (e.g. 403 for a blocked team)
  will now propagate instead of silently falling through to the
  single-team DB fallback.

* Add `test_find_and_validate_specific_team_id_non_404_http_exception_propagates`
  parametrised over 400 / 403 / 500 to lock in the contract.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(jwt-auth): gate claim-mismatch fallback behind opt-in flag

The unresolved-team-claim fallback added in the previous commit
weakened the strict claim-based authorization contract by default —
an authenticated user whose JWT carries a stale or invalid team
claim could still consume their single DB team's models/quota via
the fallback.

Gate both soft-fail paths in `find_and_validate_specific_team_id`
and `find_team_with_model_access` behind a new opt-in flag
`team_claim_fallback` on `LiteLLM_JWTAuth` (default False).

Default-off preserves the pre-existing strict behavior. Operators
who intentionally treat IdP team claims as advisory (e.g. machine
tokens whose group claims live in a separate namespace from
LiteLLM team_ids) opt in via config.

Adds two regression tests guarding the default-off behavior.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-09 01:09:03 +03:00
yuneng-jiang
69a7bdb247
fix(model-management): allow deleting a BYOK model after its team is deleted (#29875)
* fix(model-management): allow deleting a BYOK model after its team is deleted

A team BYOK model (model_info.team_id set) became undeletable once its team
was deleted: POST /model/delete ran can_user_make_model_call, which looked the
team up and raised 400 "Team id=... does not exist in db" before the delete
could run, so the model lingered on the Models + Endpoints page with no way to
remove it.

Drop the team-existence prerequisite from the delete path. When the model's
team still exists the normal auth check runs unchanged; when it is gone a proxy
admin may delete the orphan and any other caller gets a 403. The check is
fail-closed, so a missing or errored team lookup can only block the delete or
require an admin, never grant a non-admin access. Add/update/health keep their
team-existence validation.

* refactor(model-management): drop redundant team lookup on model delete

Move the orphaned-team handling into can_user_make_model_call behind an
allow_missing_team flag instead of pre-checking team existence in delete_model.
The endpoint no longer issues its own litellm_teamtable lookup, so deleting a
model whose team still exists hits the team table once instead of twice. The
auth behavior is unchanged: a proxy admin can delete a model whose team was
deleted, any other caller gets a 403, and add/update/health keep the strict
"team must exist" validation.
2026-06-08 14:28:39 -07:00
Sameer Kankute
dfb68a23de
feat(galileo): add health check support for UI callback test (#29908)
* feat(galileo): add health check support for UI callback test

Register galileo in /health/services so the proxy UI callback connection test works.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(galileo): verify API key via /current_user health check

Call Galileo's current_user endpoint so the UI callback test validates credentials against the provider.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(ui): regenerate schema.d.ts for galileo health service

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(galileo): return IntegrationHealthCheckStatus from async_health_check

Fixes mypy assignment error in health_services_endpoint where response was
narrowed to IntegrationHealthCheckStatus from earlier branches.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix Galileo logging to match Langfuse across all endpoint types.

Stop skipping ingest when output is empty and log embeddings with a placeholder so embedding, speech, and other non-text responses are recorded like Langfuse.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(galileo): remove unreachable health-check guard and None output sentinel

The use_v2_api flag is derived from bool(api_key), so the inner
GALILEO_API_KEY check inside the v2 branch could never run; collapse the
credential validation into the username/password path with a combined
message. _serialize_galileo_output now returns an empty string for None,
so _get_galileo_input_output_content always yields a str and the
post-call None coalescing guard is no longer needed.

* test(galileo): cover async_health_check failure paths and empty model response

Add regression tests for the Galileo health check unhealthy branches
(missing project id, missing base url, missing credentials, auth
failure, and request exception) and for logging a model response with
no choices, which now queues an empty output instead of being skipped.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>
2026-06-08 13:57:03 -07:00
Sameer Kankute
32c88ca74f
Litellm oss staging 080626 (#29932)
* feat(bedrock_mantle): add SigV4/IAM auth to Responses API route (fixes #29665) (#29788)

* feat(responses): add default no-op sign_request to BaseResponsesAPIConfig

* feat(responses): call sign_request after body is final, send signed bytes when signed

* feat(bedrock_mantle): add SigV4 sign_request via composed BaseAWSLLM (bearer path)

* test(bedrock_mantle): cover SigV4 access-key, AssumeRole, body bytes, region/auth consistency

* feat(bedrock_mantle): defer auth to sign_request; validate_environment no longer requires bearer

* docs(bedrock_mantle): document SigV4 + Bearer auth on Responses route

* test(responses): cover fake-stream signing order and mantle bearer arg/env precedence

* fix(bedrock_mantle): wrap all botocore credential errors with both-paths guidance

* fix(bedrock_mantle): catch specific credential errors, not all BotoCoreError, so STS transport failures are not masked

* fix(bedrock_mantle): sign the compact Responses route too, not just create

* fix(github-copilot): route per-model on /v1/responses based on model info (#29747)

* feat(focus): add GCS destination for FOCUS export (#29751)

* test: add failing tests for FocusGCSDestination

* feat: add FocusGCSDestination reusing GCSBucketBase auth

* feat: register FocusGCSDestination in factory; export from __init__

* fix(focus): preserve GCS_PATH_SERVICE_ACCOUNT when service_account_json not in config

* style: apply Black formatting to gcs_destination and tests

* style: apply Black formatting to factory.py

* fix(bedrock): omit empty additionalModelRequestFields and system from Converse API payload (#29565)

Amazon Nova Pro (and other strict Bedrock models) return 400 Malformed input
request when additionalModelRequestFields: {} or system: [] are present in the
payload. Both fields are optional in CommonRequestObject (total=False) and must
be omitted rather than sent as empty structures.

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(proxy): recognize *.cognitiveservices.azure.com as OpenAI-compatible in pass-through cost tracking (#29730)

* fix(proxy): recognize *.cognitiveservices.azure.com as OpenAI-compatible

Azure OpenAI resources created via the newer "Azure AI Foundry" /
Cognitive Services pathway live on `*.cognitiveservices.azure.com`
subdomains, not the older `openai.azure.com`. Both are valid Azure
OpenAI surfaces in production today.

The OpenAI pass-through cost-tracking handler hard-codes only the older
hostname in five places (four `is_openai_*_route` methods on
OpenAIPassthroughLoggingHandler, plus is_openai_route on
PassThroughEndpointLogging). As a result, calls from newer Azure
deployments are silently classified as "not an OpenAI route", the
dispatch into the cost-tracking handler is skipped, and tokens/cost
never get extracted into LiteLLM_SpendLogs — the row gets written with
prompt_tokens=0, completion_tokens=0, spend=0, model='unknown'.

Reproduced 2026-06-04 against a real Azure OpenAI deployment on
`*.cognitiveservices.azure.com` proxied through LiteLLM v1.88.0.

Fix: factor the hostname check into a single helper
`_is_openai_compatible_host` listing all three recognized surfaces
(api.openai.com, openai.azure.com, cognitiveservices.azure.com), and
have all five call sites delegate to it. Purely additive — never
weakens recognition for the originally-supported hostnames.

Adds a test
`test_is_openai_route_recognizes_cognitiveservices_azure_com` that
exercises all four `is_openai_*_route` static methods against
`*.cognitiveservices.azure.com` URLs (positive cases per route + a
small cross-route negative to confirm route-specific path matching
still works on the new hostname).

Out of scope for this PR (separate followup):
  - `openai_passthrough_handler` calls chat/completions
    `transform_response` on Responses API payloads (`output:` not
    `choices:`), which throws inside the dispatch and drops the
    SpendLogs row entirely. Recognized + tracked separately.

* ci: trigger fresh run

Empty commit to re-run checks. The previous auth-and-jwt failure was
a transient HuggingFace Hub 429 rate-limit hitting tokenizer downloads
in tests/proxy_unit_tests/test_custom_tokenizer_bug.py — unrelated to
this PR's scope (hostname recognition in pass-through cost tracking).
No code change.

---------

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>

* fix(responses): preserve forced-function tool_choice name in Responses to Chat transform (#29812)

The Responses API forces a specific function with a top-level name
({"type": "function", "name": "X"}), but _transform_tool_choice only handled the
nested Chat Completions shape and fell through to returning "required" for the flat
form, silently dropping the function name and degrading a forced function call to
force-any-tool. Map the flat Responses shape to the nested Chat shape, keeping the
"required" fallback when no name is present.

* Preserve x-anthropic-billing-header system blocks for first-party Anthropic (#29584)

* Preserve x-anthropic-billing-header system blocks for first-party Anthropic

PR #20951 strips system blocks beginning with "x-anthropic-billing-header:" for
every Anthropic target. That block is how the first-party Anthropic API recognizes
Claude Code subscription (OAuth) traffic, so dropping it makes requests that carry
only that block, such as the auto-mode tool-safety classifier, fail with a
misleading 429 rate_limit_error; normal turns still work because they also carry
the "You are Claude Code" identity block.

Gate the strip behind should_strip_billing_metadata(), defaulting to False on the
first-party AnthropicConfig and AnthropicMessagesConfig so the block is kept, and
overridden to True on the providers that reach these transforms and reject the
block (Bedrock platform, Vertex, Azure for the chat path; Minimax, Azure, DeepSeek
for the messages path). Behavior for those providers is unchanged.

* Strip billing header on Bedrock invoke and Vertex messages pass-through

Two more subclasses reach the gated strip but inherited keep-by-default.
AmazonAnthropicClaudeConfig (Bedrock invoke) calls AnthropicConfig.transform_request,
which calls translate_system_message, and VertexAIPartnerModelsAnthropicMessagesConfig
(Vertex messages pass-through) calls super().transform_anthropic_messages_request.
Override should_strip_billing_metadata() to True on both.

Add a parametrized test asserting the flag for every first-party base (False) and
provider subclass (True), covering all overrides, plus a translate_system_message
regression test for the Bedrock invoke path.

* fix(cache): log hashed cache keys (#29890)

* fix(ui): save routing groups as list (#29889)

* Revert "fix(ui): save routing groups as list (#29889)" (#29928)

This reverts commit 9b1f78ffa7a309cabe5e9a7ab5f94d1224d192c9.

* feat(parasail): add Parasail as a JSON-configured OpenAI-compatible provider (#29842)

* feat(parasail): add Parasail as a JSON-configured OpenAI-compatible provider

Registers parasail in the openai_like JSON provider loader with both
/v1/chat/completions and /v1/responses support. Parasail's Responses API
rejects store:true and any request that omits store, so the loader gains a
force_store_false special_handling flag; the parasail entry sets it and
the generated Responses config overrides store=false on every call. This
keeps callers from hitting "State storage not supported" and matches what
Parasail's docs require.

Adds the PARASAIL enum value, listing under openai_compatible_providers,
provider documentation at docs/my-website/docs/providers/parasail.md, and
a focused unit test file under tests/test_litellm/llms/parasail/ that
covers JSON registration, chat URL construction, Responses URL
construction with PARASAIL_API_BASE override, and the force_store_false
regression in both the caller-sent-store=true and caller-omitted cases.

* fix(parasail): register in provider_endpoints_support, drop in-repo docs

Greptile review feedback. The provider doc belongs in the litellm-docs
repo, not this one's docs/my-website tree; removing it here. Adds the
parasail entry to provider_endpoints_support.json so the
check_provider_folders_documented.py CI check passes (chat_completions
and responses true; others false).

* fix: normalize Anthropic passthrough server tool usage (#29827)

* test(anthropic): cover server_tool_use dict cost tracking

* fix: normalize Anthropic server tool usage

(cherry picked from commit 982f726bed7d3ec05e463c5dd3d090bebae91d19)

* fix: keep server tool usage subscriptable

(cherry picked from commit 70280b9b272455b2f974d08bc697f67f929755bf)

---------

Co-authored-by: Genmin <joey@joeyroth.com>

* fix(proxy): fix typo generic_role_mappoings -> generic_role_mappings in ui_sso.py (#29753)

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>

* feat(proxy): add disable_budget_reservation general setting (#27639) (#29493)

* feat(proxy): add disable_budget_reservation general setting (#27639)

* feat(proxy): register disable_budget_reservation in ConfigGeneralSettings (#27639)

* docs(proxy): document disable_budget_reservation concurrency tradeoff (#27639)

* ci: re-trigger flaky docker build (prisma generate ECONNRESET)

* fix(proxy): warn and document budget enforcement tradeoff when disable_budget_reservation is set (#27639)

* feat(gemini_tts): adding support to Gemini TTS languageCode parameters (#29623)

* Adding support to Gemini TTS Language Code parameters

* Mapping Gemini TTS languageCode param in Docstring

* Use snake_case for language_code input keyMapping Gemini TTS languageCode param in Docstring

* Restoring files modified under enterprise/litellm_enterprise due to lint/formatting checks

---------

Co-authored-by: João Garrido <joaogarrido@google.com>

* feat(guardrails): capture user and model metadata in CrowdStrike AIDR (#29517)

* fix(proxy): require OpenAI path segment for shared Azure Cognitive Services domains

Address Greptile review: the `*.cognitiveservices.azure.com` /
`*.openai.azure.com` domains are shared by every Azure Cognitive Service
(Speech, Vision, Language, ...), so a hostname-only substring match
misclassified non-OpenAI Azure traffic as OpenAI routes.

- Replace the substring host test with suffix matching (rejects look-alike
  domains like cognitiveservices.azure.com.attacker.example).
- Add `_is_openai_compatible_url` that requires an OpenAI-style path marker
  (`/openai/` or `/v1/`) on the shared Azure domains, and use it in
  PassThroughEndpointLogging.is_openai_route (previously hostname-only).
- Add negative tests for Azure Speech/Vision paths and look-alike domains.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix: support Responses input in Redis semantic cache (#29581)

* fix: support responses input in redis semantic cache

* test: cover redis semantic prompt extraction

* test: handle blank redis semantic text fallbacks

* chore: remove async cache dead statement

* test: cover redis semantic cache miss paths

* fix: filter sensitive cache lookup kwargs

* chore: rerun ci after huggingface rate limit

* chore(ui): regenerate dashboard API types (npm run gen:api)

Sync src/lib/http/schema.d.ts with the proxy OpenAPI spec: adds the
disable_budget_reservation general-settings field and picks up the
RateLimitError docstring reindent. Fixes the gen:api CI drift check.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(bedrock): assert empty additionalModelRequestFields is omitted

The Converse transformer now drops an empty additionalModelRequestFields
block instead of sending it as `{}`. Update test_bedrock_top_k_param so
models without top_k support (llama3) assert the key is absent rather than
equal to an empty dict.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Kent <72616338+kingdoooo@users.noreply.github.com>
Co-authored-by: codgician <15964984+codgician@users.noreply.github.com>
Co-authored-by: Praveen Ghuge <95286176+pghuge-cloudwiz@users.noreply.github.com>
Co-authored-by: Roi <roytev@gmail.com>
Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Liam Scott <liam@uilliam.com>
Co-authored-by: abhay23-AI <abhaytrivedi22@gmail.com>
Co-authored-by: Ceder Dens <cederdens@gmail.com>
Co-authored-by: 冯基魁 <56265583+fengjikui@users.noreply.github.com>
Co-authored-by: Kai Huang <kaihuang724@gmail.com>
Co-authored-by: rinto <54238243+ririnto@users.noreply.github.com>
Co-authored-by: Genmin <joey@joeyroth.com>
Co-authored-by: Arnav Bhilwariya <arnavbhilwariya0408@gmail.com>
Co-authored-by: Armaan Sandhu <74664101+Ar-maan05@users.noreply.github.com>
Co-authored-by: João Garrido <48538534+johngarrido@users.noreply.github.com>
Co-authored-by: João Garrido <joaogarrido@google.com>
Co-authored-by: Kenan Yildirim <kenan@kenany.me>
Co-authored-by: Dávid Balatoni <balcsida@gmail.com>
2026-06-08 13:49:52 -07:00
Sameer Kankute
f5b11b72a6
feat(proxy): publish /v2/model/info in Swagger OpenAPI spec (#29900)
* feat(proxy): publish /v2/model/info in Swagger OpenAPI spec

Expose the v2 model info endpoint in /docs by removing include_in_schema=False
and documenting query parameters used by the admin UI and proxy CLI consumers.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(ui): regenerate schema.d.ts for /v2/model/info OpenAPI docs

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-08 09:33:35 -07:00
Yassin Kortam
5e2db7eee4
feat(litellm): add models and repository layers (#29686) 2026-06-06 20:59:33 -07:00
Mateo Wang
13924fa1d6
feat: standardize rate limit errors with category, rate_limit_type, model, and llm_provider fields (#27687)
* feat(exceptions): add RateLimitErrorCategory + headers/detail fields on RateLimitError

LiteLLM previously surfaced rate-limit conditions through several unrelated
error classes (RateLimitError, FastAPI HTTPException(429), BaseLLMException).
This commit adds the data model needed to consolidate them under a single
class:

* RateLimitErrorCategory enum exposing four categorical values
  (vendor_rate_limit, vendor_batch_rate_limit, litellm_rate_limit,
  litellm_batch_rate_limit) so callers can switch on the rate-limit source.
* New optional fields on RateLimitError:
  - category (defaults to vendor_rate_limit, preserving today's behavior for
    every existing call site in exception_mapping_utils);
  - headers (preserves retry-after / rate_limit_type / reset_at across the
    proxy boundary instead of dropping them on the floor);
  - detail (mirrors FastAPI HTTPException.detail so the same instance can be
    serialized through both paths).

litellm.RateLimitErrorCategory is re-exported at the package root to match
the existing exception-export pattern.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* feat(proxy): add ProxyRateLimitError unifying RateLimitError + HTTPException

Adds a single proxy-side error class that subclasses BOTH
litellm.exceptions.RateLimitError AND fastapi.HTTPException via cooperative
multiple inheritance.

Why both bases:
* Subclassing RateLimitError lets user code catch every rate-limit source
  with one 'except RateLimitError' and switch on the new .category field.
* Subclassing HTTPException keeps every existing FastAPI plumbing path (the
  isinstance(e, HTTPException) branches in proxy_server.py route handlers,
  FastAPI's own dispatcher, and tests asserting pytest.raises(HTTPException))
  working without modification, and preserves retry-after / rate_limit_type /
  reset_at headers on the wire.

The class declaration order is (HTTPException, RateLimitError) so the MRO
puts HTTPException's no-super-call __init__ ahead of openai's cooperative
__init__ chain — preventing openai.APIError.super().__init__(message) from
landing in HTTPException.__init__(status_code=message).

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* refactor(proxy/hooks): raise ProxyRateLimitError from budget + iteration limiters

Replaces three bare HTTPException(status_code=429, ...) call sites with
ProxyRateLimitError, which is both a RateLimitError (catchable by category)
and an HTTPException (preserves existing FastAPI serialization). Drops the
now-unused HTTPException import in the iteration / per-session limiters.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* refactor(proxy/hooks): raise ProxyRateLimitError from parallel-request limiters

Replaces HTTPException(status_code=429, ...) call sites in the v1 and v3
parallel-request limiters (key/team/user/model/customer rate limits) with
ProxyRateLimitError. Updates the raise_rate_limit_error helper's return type
annotation accordingly.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* refactor(proxy/hooks): raise ProxyRateLimitError from dynamic rate limiters

Replaces HTTPException(status_code=429, ...) call sites in the v1 and v3
dynamic rate limiters (project-level TPM/RPM allocation, model-saturation
checks, priority-based limits, fail-closed guards) with ProxyRateLimitError.
The v3 limiter still imports HTTPException for an unrelated bare 'except
HTTPException:' branch.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* refactor(proxy/hooks): raise ProxyRateLimitError from batch rate limiter

Replaces HTTPException(status_code=429, ...) in batch_rate_limiter._raise_rate_limit_error
with ProxyRateLimitError tagged as RateLimitErrorCategory.LITELLM_BATCH_RATE_LIMIT
so users can distinguish batch-level throttling (which counts requests/tokens
across an uploaded batch input file before submission) from the generic
key/team/user RPM/TPM limiter.

The HTTPException import is retained because the same module raises
HTTPException for unrelated 403/IO error paths.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(rate-limit): pin down unified rate-limit error contract

Adds a dedicated test module covering the new RateLimitErrorCategory enum,
RateLimitError.category default + override behavior, ProxyRateLimitError's
dual nature (RateLimitError + HTTPException), and a parametrized regression
guard that asserts every proxy hook module imports the unified class.

The regression guard catches the failure mode the refactor is designed to
prevent: someone re-introducing a bare HTTPException(status_code=429, ...)
in one of the hook modules instead of going through ProxyRateLimitError.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* feat(logging): expose rate-limit category via StandardLoggingPayload

Adds an optional 'error_rate_limit_category' field to
StandardLoggingPayloadErrorInformation, populated from the unified
RateLimitError.category attribute (introduced in the previous commits on
this branch).

Why: the .category attribute is reachable off the raw exception today via
getattr(e, 'category', None), but the structured contract that downstream
custom callbacks / loggers / spend log writers consume is the
StandardLoggingPayload. Without this field, a user building custom
rate-limit metrics on top of callback data has to special-case the raw
exception object — which defeats the purpose of the StandardLoggingPayload
abstraction.

The field is None for non-rate-limit exceptions (so consumers can read it
unconditionally without isinstance checks) and is one of the
RateLimitErrorCategory string values otherwise.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(rate-limit): assert StandardLoggingPayload carries the category

Five tests covering: vendor default, explicit litellm_rate_limit and
litellm_batch_rate_limit values, None for non-rate-limit exceptions, and
None when no exception is provided. Pins down the contract that custom
callbacks can read 'error_information.error_rate_limit_category' off the
StandardLoggingPayload to drive custom rate-limit metrics without ever
reaching for the raw exception.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(types): silence mypy [misc] on intentional dual-base attr overlap

mypy emits two [misc] errors on the ProxyRateLimitError class line because
its two bases declare overlapping attributes with related-but-not-identical
annotations:

* status_code: int on starlette HTTPException vs. Literal[429] on openai's
  RateLimitError (every openai status-error subclass narrows it the same
  way and silences pyright with the same convention).
* headers: Mapping[str, str] | None on HTTPException vs. our Optional[
  Dict[str, str]] (the proxy hooks always carry a stringified dict).

Both narrowings are intentional and enforced at construction time. Add a
type: ignore[misc] with an inline explanation rather than relax the
annotations on the parent or change the wire-format guarantees.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(rate-limit): add direct hook-invocation tests to lift patch coverage

Adds six end-to-end tests that drive each refactored hook past its
limit and assert the unified ProxyRateLimitError is raised with the
correct category and dual-base shape. Complements the
import-shape-only parametrized guard above by actually executing the
new 'raise ProxyRateLimitError(...)' lines so codecov's patch coverage
sees them as hit.

Hooks covered (one test each):
* parallel_request_limiter v1 — direct call to raise_rate_limit_error()
* parallel_request_limiter v3 — direct call to _handle_rate_limit_error
  with a fabricated OVER_LIMIT response
* max_iterations_limiter — full async_pre_call_hook with mocked agent
  registry, second call exceeds budget=1
* max_budget_limiter — async_pre_call_hook with mocked get_current_spend
* dynamic_rate_limiter v1 — async_pre_call_hook with mocked
  check_available_usage forcing available_tpm == 0
* batch_rate_limiter — direct _raise_rate_limit_error call, asserts
  category is the batch-specific LITELLM_BATCH_RATE_LIMIT (not the
  generic LITELLM_RATE_LIMIT)

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix: guard rate_limit_category extraction with isinstance check

* test(rate-limit): cover remaining hook raise sites for codecov

Adds five more direct hook-invocation tests so every PR-touched line
in the proxy hooks is exercised by tests in tests/test_litellm/, which
codecov measures:

* parallel_request_limiter v1 — check_key_in_limits inline raise
  (the second raise site, separate from the raise_rate_limit_error
  helper covered earlier)
* dynamic_rate_limiter v1 — RPM raise branch (TPM branch was already
  covered)
* dynamic_rate_limiter v3 — parametrized over all three raise sites:
  model_saturation_check, priority_model, and the fail-closed
  fallback for an unrecognized descriptor_key
* max_budget_per_session_limiter — full async_pre_call_hook with a
  mocked agent registry and over-budget cached spend

All 42 tests in test_rate_limit_error_unification.py now pass and
together exercise every changed import + raise line across the eight
refactored proxy hooks.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix: use computed error_message in ProxyRateLimitError detail

* fix(parallel-request-limiter): drop None from detail; annotate raise_rate_limit_error as NoReturn

The v1 ' raise_rate_limit_error' helper built an unused 'error_message'
variable and then assembled the actual ' detail' via an f-string that
interpolated 'additional_details' verbatim — producing
'Max parallel request limit reached None' when invoked without
arguments (flagged by code review).

Fix the helper to:
- use the constructed 'error_message' as the detail
- annotate the helper as NoReturn since it always raises
- drop the redundant 'raise'/'return' at the two call sites

Add two regression tests covering both the with- and without-
additional_details paths.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(proxy/hooks): drop literal 'None' from raise_rate_limit_error detail

The v1 parallel_request_limiter's raise_rate_limit_error helper has a
long-standing bug: it computes a None-guarded 'error_message' string but
then ignores it and emits an f-string that interpolates the raw
'additional_details' arg. Callers that pass no argument get
'Max parallel request limit reached None' as the user-facing detail.

This commit:
* wires error_message into the detail kwarg so the None-guard actually
  applies and operators see a clean message;
* changes the return-type annotation from ProxyRateLimitError to NoReturn
  (the function always raises) so type-checkers know callers after this
  invocation are unreachable.

Greptile P1 + P2 review feedback on PR #27687.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(types): demote TypedDict floating string to a # comment

A string literal placed after a field declaration in a TypedDict body is
not a per-field docstring — it's an orphaned string expression Python
discards. Tools like mypy / pyright that inspect TypedDict fields won't
surface that text either.

Move the documentation for error_rate_limit_category to a real comment
so the intent is visible to readers and type-checker tooling without
the misleading docstring framing.

Greptile P2 review feedback on PR #27687.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* security(exceptions): do not auto-copy vendor response headers to e.headers

A vendor 429 response can set arbitrary headers (Set-Cookie, CORS
overrides, …). Previously, when RateLimitError was constructed with only
a 'response=' (no explicit 'headers=' kwarg), self.headers fell back to
a copy of response.headers. If a downstream proxy serializer ever
forwarded e.headers to the client, a malicious upstream could inject
browser-interpreted headers for the proxy origin.

Drop the fallback. Only headers passed explicitly via the headers= kwarg
make it onto self.headers (proxy hooks pass retry-after etc. — they
control what's surfaced). Vendor response headers stay reachable on
e.response.headers for callers that explicitly want them.

Today's proxy_server.py route handlers don't actually forward e.headers
on the wire (they construct ProxyException without passing headers), so
no current behavior changes — this is a defensive narrowing so the
fallback can never be turned into a vector when someone wires
e.headers through later.

Veria-AI security review feedback on PR #27687.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(rate-limit): regression guards for review-pass fixes

Pins down the three review-pass fixes:

* test_parallel_request_limiter_v1_helper_no_additional_details — calls
  raise_rate_limit_error() with no args and asserts the detail does NOT
  contain the literal string 'None'. Pre-fix, callers got 'Max parallel
  request limit reached None'.
* test_rate_limit_error_does_not_auto_copy_response_headers — passes a
  vendor httpx.Response with a Set-Cookie header to RateLimitError
  WITHOUT an explicit headers= kwarg, asserts self.headers stays None
  (no leak), then re-checks that an explicit headers= kwarg DOES
  populate self.headers. Vendor headers remain reachable on
  e.response.headers for callers that explicitly want them.
* The existing v1-helper test now also asserts the additional_details
  string makes it through to the detail.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* feat(rate-limit): add orthogonal RateLimitType (requests/tokens/concurrent_requests/budget/max_iterations)

trho's last ask in the LIT-2968 thread: distinguish rate-limit failures by
the dimension that was exceeded, not just by who rate-limited (vendor vs.
litellm). Adds:

- RateLimitType str-enum exposed at `litellm.RateLimitType` with values
  requests / tokens / concurrent_requests / budget / max_iterations.
- `rate_limit_type` kwarg on litellm.RateLimitError + ProxyRateLimitError;
  None default so existing callers (vendor-429 path in exception_mapping_utils)
  remain a no-op.
- StandardLoggingPayloadErrorInformation.error_rate_limit_type so custom
  callbacks can split rate-limit failures by cause without parsing free-text
  error messages. Mirror to error_rate_limit_category extraction in
  get_error_information(); single isinstance(RateLimitError) check covers both.
- map_v3_rate_limit_type() helper to collapse the v3 limiter's internal labels
  ("requests", "tokens", "max_parallel_requests") onto the public enum so
  the v3 limiter and dynamic_rate_limiter_v3 share one mapping. Defensive
  None on unknown values rather than silently picking a wrong dimension.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* feat(proxy/hooks): wire rate_limit_type onto every limiter raise site

Each refactored proxy hook now populates rate_limit_type with the dimension
that actually tripped the limit, so downstream consumers (custom callbacks,
prometheus exporters via the StandardLoggingPayload) can split key/team/user
rate-limit failures by cause:

- parallel_request_limiter (v1): detect dimension from current vs. limit in
  the post-cache branch (concurrent_requests > tokens > requests, matches the
  boolean condition order). Base case (current is None, one limit set to 0)
  picks the most-specific zero. raise_rate_limit_error() helper accepts an
  explicit rate_limit_type kwarg with CONCURRENT_REQUESTS default (matches
  every existing internal call site, including the global-limit branch).
- parallel_request_limiter (v3): forward status["rate_limit_type"] through
  map_v3_rate_limit_type() so "max_parallel_requests" → CONCURRENT_REQUESTS
  for the public field while the raw v3 jargon stays on the HTTP header for
  wire-format backward compat.
- dynamic_rate_limiter (v1): TPM-zero → TOKENS, RPM-zero → REQUESTS. Pass
  data["model"] through so callbacks see the model that hit the limit
  (addresses the secondary "provider missing" complaint in the original
  Slack thread, partially — the model is what dashboards typically split on).
- dynamic_rate_limiter (v3): forward status["rate_limit_type"] via
  map_v3_rate_limit_type() at every raise site (model_saturation_check,
  priority_model, fail-closed unknown-descriptor guard). Also pass model.
- batch_rate_limiter: limit_type is hard-typed "requests"|"tokens" — map
  directly without going through the helper's None branch.
- max_budget_limiter, max_budget_per_session_limiter: BUDGET.
- max_iterations_limiter: MAX_ITERATIONS.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(rate-limit): cover RateLimitType enum, hook wiring, and StandardLoggingPayload propagation

27 new tests across five new test classes:

- TestRateLimitType: enum exposed at litellm.RateLimitType, all five values
  defined, RateLimitError default is None (vendor 429 path makes no claim
  about which dimension), accepts both string and enum forms with
  str-coercion guarantee for downstream JSON serializers.
- TestProxyRateLimitErrorType: ProxyRateLimitError default is None, accepts
  string or enum, doesn't break existing callers that pass nothing.
- TestMapV3RateLimitType: pins each v3-internal → public-enum mapping
  (tokens, requests, max_parallel_requests → concurrent_requests, unknown
  → None) so a future v3 refactor can't silently swap dimensions.
- TestStandardLoggingPayloadCarriesType: the new error_rate_limit_type
  field reaches the structured payload for both ProxyRateLimitError and
  plain RateLimitError, is None when unspecified, and is None for
  non-rate-limit exceptions (symmetric with error_rate_limit_category).
- TestProxyHooksWireTypeCorrectly: drives the actual raise sites in the
  v1 parallel_request_limiter helper, the v3 _handle_rate_limit_error
  (both "tokens" and "max_parallel_requests" paths), and the batch
  limiter (both tokens and requests paths) — coverage tools see the new
  rate_limit_type= kwargs as exercised, not just the import shape.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(rate-limit): cover _coerce_message branches and v1 dimension detection

Drives the patch coverage on the new orthogonal RateLimitType wiring up
to (or close to) 100% on the touched files.

ProxyRateLimitError._coerce_message — was 22% covered, now 100%:
* nested {error: {message}} dict
* nested {message: {message}} dict (alt key)
* dict without 'error'/'message' keys → JSON dump fallback
* non-JSON-serializable dict value → str() fallback
* non-string non-mapping detail (int) → str() coercion

v1 parallel_request_limiter dimension detection — was 0% covered, now
exercised across 6 parametrized cases:
* check_key_in_limits else-branch: current at concurrent / TPM / RPM cap
  → asserts rate_limit_type is concurrent_requests / tokens / requests.
* check_key_in_limits base case (current is None): max_parallel_requests
  / tpm_limit / rpm_limit set to 0 → asserts the most-specific zero
  attribution wins per the helper's order.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* feat(proxy/hooks): add ProxyHTTPRateLimitError + provider resolver

Introduces a small helper layer used by every proxy-side rate-limit
hook so that the 429 they raise carries a populated llm_provider /
model — instead of an empty exception.llm_provider that downstream
loggers (Prometheus failure metric, observability callbacks) read as
'no provider attribution'.

ProxyHTTPRateLimitError inherits from both fastapi.HTTPException
(so the proxy server still renders it as a 429) and
litellm.exceptions.RateLimitError (so isinstance checks and
PrometheusLogger._get_exception_class_name pick up llm_provider).
We deliberately don't call RateLimitError.__init__ — it constructs
an httpx.Response we don't need and would just add failure surface;
attribute parity is what downstream consumers care about.

resolve_llm_provider_for_rate_limit() wraps litellm.get_llm_provider
defensively. Internal limiter hooks fire from async_pre_call_hook —
well before get_llm_provider runs anywhere else in the request
lifecycle — so we have to call it ourselves at raise time. If the
model is missing or unparseable (alias, router-only model) we fall
back to llm_provider='litellm_proxy' rather than letting a second
exception leak out and break the request path.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(proxy/hooks): populate llm_provider on parallel-request 429s

Both v1 and v3 parallel-request limiters fired bare HTTPException(429)
from inside async_pre_call_hook. The downstream Prometheus failure
metric reads exception.llm_provider via _get_exception_class_name —
the empty value showed up as exception_class='HTTPException' and
left model_id='None' on the time series.

Threads requested_model through every raise site in:

* parallel_request_limiter.py:
  - check_key_in_limits (the per-key/per-model/per-user/per-team/
    per-customer over-limit path)
  - raise_rate_limit_error (zero-limit + global_max_parallel_requests
    paths) — now takes an optional requested_model kwarg
* parallel_request_limiter_v3.py:
  - _handle_rate_limit_error (the OVER_LIMIT translator), called
    from both the should_rate_limit pre-check and the TPM
    reservation path

Resolved via resolve_llm_provider_for_rate_limit so unknown / missing
models silently fall back to llm_provider='litellm_proxy' instead of
breaking the request path with a second exception.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(proxy/hooks): populate llm_provider on dynamic-rate-limit 429s

Same plumbing change as the parallel limiters, applied to both
dynamic_rate_limiter (v1) and dynamic_rate_limiter_v3:

* v1: TPM-zero and RPM-zero paths in async_pre_call_hook now resolve
  data['model'] -> (model, llm_provider) once and pass it into both
  raises.
* v3: All three raise sites in _check_rate_limits — the
  model_saturation_check enforced raise, the priority_model
  enforced raise, and the fail-closed unknown-descriptor branch —
  now attribute the 429 to the actual provider.

Falls back to llm_provider='litellm_proxy' when the model can't be
resolved.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(proxy/hooks): populate llm_provider on batch-rate-limit 429s

batch_rate_limiter._raise_rate_limit_error now takes a
requested_model kwarg threaded from data['model'] in
_check_and_increment_batch_counters. The batch-creation 429 is what
gets raised when the input file's tokens/requests count would push
the per-key TPM/RPM window over its limit.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(proxy/hooks): populate llm_provider on budget/iterations 429s

Final batch of internal raise sites — the user/session-budget and
max-iterations hooks. Same pattern: resolve data['model'] once at
raise time, attach to ProxyHTTPRateLimitError so Prometheus and
observability callbacks can attribute the 429.

Hooks updated:
* max_budget_limiter (per-user max_budget exceeded)
* max_iterations_limiter (per-session agent iteration cap)
* max_budget_per_session_limiter (per-session dollar cap)

All three fall back to llm_provider='litellm_proxy' when data['model']
is missing or unparseable. Drops the now-unused HTTPException import
from each module.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(proxy/hooks): pin provider field on internal rate-limit 429s

Regression coverage for the 'provider field missing' bug across every
proxy-side rate-limit hook + the helper layer:

* ProxyHTTPRateLimitError class shape (HTTPException + RateLimitError,
  dict-detail stringification, None-provider normalization).
* resolve_llm_provider_for_rate_limit happy paths
  (gpt-4o-mini, anthropic/..., bedrock/...) plus all three fallback
  branches (None, '', unknown name) plus a 'get_llm_provider raises'
  case that asserts we swallow the secondary exception.
* For each limiter (parallel v1/v3, dynamic v1/v3, batch,
  max_budget, max_iterations, max_budget_per_session): assert the
  raised exception is a RateLimitError carrying the resolved
  model + llm_provider, and a sibling test that asserts the
  fallback path returns 'litellm_proxy' without leaking a second
  exception.
* Two PrometheusLogger._get_exception_class_name pins so the
  Prometheus failure metric label flips from 'HTTPException' to
  'Openai.ProxyHTTPRateLimitError' (or 'Litellm_proxy.*' on
  fallback) — that's what dashboards consume.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* perf(proxy/hooks): defer provider resolution to over-limit branches

* fix: use error_message in raise_rate_limit_error to avoid literal 'None' in detail

* Consolidate rate_limiter_utils imports in dynamic_rate_limiter

* fix(proxy): set num_retries/max_retries on ProxyHTTPRateLimitError

ProxyHTTPRateLimitError inherits from RateLimitError but did not call
RateLimitError.__init__, so num_retries/max_retries were never set.
When Starlette's HTTPException lacks __str__, MRO falls through to
RateLimitError.__str__, which unconditionally reads these attributes
and raises AttributeError during logging/traceback formatting.
Initialize them to None defensively.

* fix(mypy): silence base-class status_code conflict on ProxyHTTPRateLimitError

HTTPException declares 'status_code: int' while openai.RateLimitError
(via APIStatusError) declares 'status_code: Literal[429] = 429'. Mypy
flags the multi-base override as [misc] in CI lint. The runtime semantics
are fine (we set self.status_code in __init__), so silence the
class-level annotation conflict with a targeted ignore.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix: annotate batch limiter _raise_rate_limit_error as NoReturn

* feat(prometheus): rate-limit category/type labels + exception_class back-compat (follow-up to #27687) (#27706)

* feat(prometheus): add rate_limit_category and rate_limit_type labels

Adds two new labels to litellm_proxy_failed_requests_metric so dashboards
can split 429s by rate-limit source (vendor vs. litellm-internal) and by
the dimension that was exceeded (requests/tokens/concurrent_requests/
budget/max_iterations) without parsing free-text error messages.

Closes the Prometheus side of LIT-2718. The unified RateLimitError.category
and .rate_limit_type fields landed in PR #27687 but were only surfaced on
StandardLoggingPayload (custom-callback channel); this exposes them on
the metric label set as well.

Both labels are populated only when the underlying exception is a
litellm.RateLimitError; non-rate-limit failures keep them empty.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* feat(prometheus): populate rate-limit labels + preserve exception_class back-compat

Two coupled changes in the Prometheus integration:

1. async_post_call_failure_hook now extracts the new RateLimitError
   .category / .rate_limit_type fields (added in PR #27687) via a
   _extract_rate_limit_labels helper and forwards them through
   UserAPIKeyLabelValues onto litellm_proxy_failed_requests_metric.
   Empty for non-rate-limit failures.

2. _get_exception_class_name special-cases ProxyRateLimitError and
   keeps emitting 'HTTPException' for the exception_class label.
   Without this shim, ProxyRateLimitError (which multi-inherits from
   HTTPException + RateLimitError) would silently flip the label
   from 'HTTPException' (the historical value for proxy-side 429s)
   to 'ProxyRateLimitError', breaking existing dashboards / alerts
   that key off exception_class='HTTPException'. Distinguishing
   vendor vs. litellm 429s is now the job of the new
   rate_limit_category label.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(prometheus): cover rate-limit labels and exception_class back-compat

Adds 19 tests across:
- enum / label-list registration
- _extract_rate_limit_labels for vendor RateLimitError, ProxyRateLimitError,
  non-rate-limit and None inputs (incl. parametrized over every
  RateLimitErrorCategory x RateLimitType combo)
- _get_exception_class_name back-compat: ProxyRateLimitError keeps the
  legacy 'HTTPException' string while vendor RateLimitError keeps the
  historical 'Provider.ClassName' format
- end-to-end through async_post_call_failure_hook with both
  ProxyRateLimitError and vendor RateLimitError, asserting both new
  labels populate and exception_class stays back-compat

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(prometheus): tolerate missing fastapi in lazy ProxyRateLimitError import

Address greptile feedback:
- async_post_call_failure_hook docstring: drop the stale labelnames listing
  and reference PrometheusMetricLabels.litellm_proxy_failed_requests_metric
  as the source of truth so the doc cannot drift from the actual labelset.
- _get_exception_class_name: guard the lazy ProxyRateLimitError import with
  ImportError so router-side fallback callsites don't blow up in non-proxy
  installs that don't have fastapi (a transitive dep of
  proxy.common_utils.proxy_rate_limit_error). Behavior is unchanged when
  fastapi is available.

Also fix the existing enterprise callback test that asserted the old
labelset on litellm_proxy_failed_requests_metric — it now expects the new
rate_limit_category / rate_limit_type labels populated for vendor 429s.

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(bugbot): simplify rate-limit label coercion + guard None detail

- prometheus.py _extract_rate_limit_labels: RateLimitError.__init__ already
  normalizes category/rate_limit_type to plain str, so the getattr(.value)
  + isinstance dance was dead code. Reduce to str(value) if not None.
- proxy_rate_limit_error.py _coerce_message: short-circuit None to ''
  instead of falling through to str(None) = 'None', which produced the
  literal message 'litellm.RateLimitError: None'.

* fix(rate-limit): surface unified category/type fields on BudgetExceededError

The most common budget cap (virtual-key max_budget enforcement in
auth_checks.py) raises litellm.BudgetExceededError, a bare Exception
subclass that bypassed the unified rate-limit error class introduced
by PR #27687. Custom callbacks reading
StandardLoggingPayload.error_information saw category=None and
rate_limit_type=None for these 429s, missing the most common budget
case (team / org / end-user budgets all hit the same code path).

Surface the fields off BudgetExceededError as plain attributes:
- category = RateLimitErrorCategory.LITELLM_RATE_LIMIT
- rate_limit_type = RateLimitType.BUDGET
- llm_provider = "" (or caller-supplied)

Switch get_error_information and _extract_rate_limit_labels from
isinstance(RateLimitError) gating to duck-typed attribute reads,
guarded by membership in the rate-limit enums so unrelated third-party
exceptions exposing a .category attribute can't leak garbage values
into the payload.

This is strictly additive: BudgetExceededError keeps its bare-Exception
base class, so `except BudgetExceededError:` handlers keep firing and
`except RateLimitError:` does not start catching budget errors.

* fix(rate-limit): validate enum membership at duck-typed read sites + enrich BudgetExceededError llm_provider

Two follow-ups uncovered during the second QA pass on PR #27687:

1. Guard third-party `.category` / `.rate_limit_type` attribute leakage.
   The duck-typed read in `get_error_information` and
   `_extract_rate_limit_labels` would forward any string attribute named
   `category` / `rate_limit_type` on an unrelated third-party exception
   into the StandardLoggingPayload and Prometheus labels — silently
   mislabeling custom-callback payloads and blowing out Prometheus label
   cardinality. Add `validate_rate_limit_category` /
   `validate_rate_limit_type` helpers that gate on the documented enum
   value sets; non-matching values are dropped to None.

2. Enrich BudgetExceededError.llm_provider from request_data.
   Budget checks live in tenant-scoped helpers (key / team / org / tag /
   end-user / project) that don't see the request model, so the
   BudgetExceededError they raise carried llm_provider="" — leaving
   custom-metrics consumers without provider attribution for the most
   common 429 case. Resolve it once at the central
   UserAPIKeyAuthExceptionHandler seam, before post_call_failure_hook
   fires, so the StandardLoggingPayload the callback sees has the same
   provider attribution as RPM/TPM 429s.

Regression tests pin both: 4 leakage tests + 4 enrichment tests. The
leakage tests would fail under the pre-validation version of either read
site; the enrichment tests would fail if the handler skipped the
resolver call.

* fix(rate-limit): resolve router model_name aliases to real provider (#27914)

* fix(rate-limit): resolve router model_name aliases to real provider

For nearly every real LiteLLM proxy deployment the request model is a
router model_name alias (e.g. 'tpm-locked' -> litellm_params.model:
openai/gpt-4o-mini), and 'litellm.get_llm_provider' doesn't know about
router aliases — it raises 'LLMProviderNotProvidedError'. The resolver
then fell through to the defensive 'litellm_proxy' fallback, so the
'llm_provider' field this PR adds was effectively always
'litellm_proxy' in the field, defeating its purpose for the most common
proxy configuration.

Add a router-alias fallback step: when 'get_llm_provider' raises, scan
the active 'llm_router.model_list' for a deployment whose 'model_name'
matches the request model and resolve from its 'litellm_params.model'
instead. If multiple deployments share the same alias (load-balancing
case) the first one wins — every deployment under one alias should
agree on provider in any sensible config, and 'first' is deterministic
so the Prometheus label stays stable.

Defensive throughout: an uninitialized router, a malformed deployment,
a 'litellm_params.model' that itself fails 'get_llm_provider' — every
branch falls through to the existing 'litellm_proxy' fallback rather
than letting a secondary exception escape and mask the rate-limit
error we're trying to surface.

Tests:
  - test_router_alias_resolves_to_underlying_provider: alias
    'tpm-locked' -> 'openai/gpt-4o-mini' produces provider='openai',
    model='gpt-4o-mini'.
  - test_router_alias_with_multiple_deployments_uses_first.
  - test_router_alias_unknown_falls_back.
  - test_router_alias_with_malformed_deployment_falls_back.
  - Existing fallback test updated to also stub
    'litellm.proxy.proxy_server.llm_router' so it exercises the
    full 'no resolution anywhere' path.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(rate-limit): harden router alias resolver + test isolation

- Wrap _resolve_provider_from_router_alias loop in top-level try/except so
  a non-iterable model_list / unexpected deployment shape can't escape and
  mask the 429 with a 500.
- Type-check litellm_params before .get() to handle non-dict truthy values.
- Patch llm_router=None in the parametrized fallback test so a router left
  by another test in the session can't redirect the unknown-model path.

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(bugbot): preserve "BudgetExceededError" Prometheus label

Adding llm_provider to BudgetExceededError (so callbacks get provider
attribution from StandardLoggingPayload) made the provider-prefix step in
_get_exception_class_name silently flip the label from "BudgetExceededError"
to e.g. "Openai.BudgetExceededError", breaking dashboards keyed on the
historical value.

Short-circuit BudgetExceededError in _get_exception_class_name the same way
ProxyRateLimitError already is. Provider/category attribution still lands on
the new rate_limit_category / rate_limit_type labels.

* test: fix invalid 'rpm' rate_limit_type in v3 limiter test mocks

The v3 rate limiter only emits 'requests', 'tokens', or
'max_parallel_requests'. Using 'rpm' caused map_v3_rate_limit_type to
return None, leaving the expected RateLimitType.REQUESTS untested.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(bugbot): hoist provider resolver + opt-in prom rate-limit labels

- dynamic_rate_limiter.py: hoist resolve_llm_provider_for_rate_limit
  above the TPM/RPM if/elif so the lookup runs once per request, matching
  the pattern in dynamic_rate_limiter_v3.py.
- prometheus.py: gate the new rate_limit_category / rate_limit_type
  labels on litellm_proxy_failed_requests_metric behind
  litellm.prometheus_emit_rate_limit_labels (default False). Mirrors the
  existing prometheus_emit_stream_label opt-in. Preserves the metric's
  pre-unification label set so existing dashboards / recording rules
  keep matching after upgrade; operators can enable the new labels once
  downstream consumers include them.
- Tests updated: default-off back-compat case, opt-in path enables the
  flag before asserting label presence.

* fix: stabilize prometheus label sets and drop redundant model normalization

- Cache PrometheusLogger.get_labels_for_metric per metric_name so that
  the label set used to construct counters at __init__ time stays in
  sync with the label set used at increment time, even if module-level
  toggles like prometheus_emit_rate_limit_labels or
  prometheus_emit_stream_label are flipped at runtime. Without this,
  toggling these flags after the logger was created would cause
  ValueError from prometheus_client because the runtime labels would
  not match the counter's declared labelnames.
- Drop redundant 'model or ""' guard in ProxyRateLimitError.__init__
  where model is already normalized one step earlier.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* perf(dynamic_rate_limiter): only resolve provider when rate limit hit

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* test(prometheus): clear cached metric labels after toggling rate-limit flag

The PrometheusLogger caches each metric's label set at construction
time so that labels used at counter.labels(...) time stay consistent
with the labels the metric was registered with. The enterprise
async_post_call_failure_hook test toggles
litellm.prometheus_emit_rate_limit_labels = True AFTER the fixture
has already built the logger, so without invalidating the cache the
rate_limit_category / rate_limit_type labels never reach the mocked
counter and the assert_called_once_with check fails.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* test: fix CI failures from prom label cache + flaky time-window assertion

PrometheusLogger.get_labels_for_metric now caches the per-metric label
set at first read so the labels passed to counter.labels(...) stay in
lock step with the labels the counter was registered with. This broke
two existing test patterns:

- test_prometheus_labels.py: tests bind the real method onto a
  MagicMock, but MagicMock auto-creates a Mock for _cached_metric_labels
  whose .get(...) returns a truthy Mock — treated as a populated cache
  and returned as the label set, producing empty filtered labels and
  KeyError on labels["requested_model"] / ["route"]. Seed real {}
  containers for _cached_metric_labels and label_filters before binding.

- test_prometheus_logging_callbacks.py::test_set_team_budget_metrics_with_custom_labels:
  the fixture builds the logger before the test monkeypatches
  litellm.custom_prometheus_metadata_labels, so the cached label set
  never picks up the new metadata labels. Clear the cache after the
  monkeypatch (same pattern already used for the rate-limit toggle in
  test_async_post_call_failure_hook).

UI: view_logs/index.test.tsx "Last Minute" window assertion is off by
one at the minute boundary. start_date is floored to the minute, so the
dropped sub-minute fraction can push the truncated-seconds diff up to
(minMinutes+1)*60 exactly when the click lands near a minute rollover.
Switch the upper bound to toBeLessThanOrEqual.

* feat(otel-v2): surface rate_limit_category + rate_limit_type on failed LLM-call spans

PR #28909 introduced the typed v2 OTel engine that builds spans from
StandardLoggingPayload, with SpanError carrying error_type + message and
the genai mapper stamping error.type onto every failed LLM-call span.
This PR's earlier commits added error_rate_limit_category and
error_rate_limit_type to the same StandardLoggingPayload.error_information
the v2 engine reads — but neither field reached a span attribute, so v2
OTel traces stayed opaque about *why* a 429 fired (vendor vs litellm,
RPM vs TPM vs concurrent vs budget vs max_iterations) even after the
custom-callback and prometheus surfaces gained that decomposition.

Three coupled changes:

1. semconv.py: add LiteLLM.ERROR_RATE_LIMIT_CATEGORY /
   LiteLLM.ERROR_RATE_LIMIT_TYPE under the litellm.* vendor namespace
   (no GenAI semconv equivalent exists for who-rate-limited /
   which-dimension).

2. payloads.py: extend SpanError with rate_limit_category +
   rate_limit_type, populated by _parse_error() from the same
   error_information.error_rate_limit_* fields the custom-callback
   channel and prometheus rate_limit_category / rate_limit_type labels
   read. Single source of truth across all three observability surfaces.

3. mappers/genai.py: stamp the two attributes on the LLM-call span when
   present. drop_none guarantees they stay absent (not 'None') for
   non-rate-limit failures so trace consumers can read them
   unconditionally.

Three regression tests in test_otel_v2_emitter.py pin: a vendor /
litellm-internal RateLimitError lands category=litellm_rate_limit +
rate_limit_type=requests on the span; a BudgetExceededError lands
rate_limit_type=budget; a non-rate-limit failure (BadRequestError)
keeps the rate_limit_* attributes absent. Mutation-tested against
reverting either the SpanError extension or the _parse_error read site
— both new tests fail under either mutation.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test: align prometheus user-budget + logs quick-select tests with merged code

The merge into this branch left two test patterns out of step with the code
they exercise.

test_set_user_budget_metrics_includes_user_email_and_alias_labels_when_opted_in
flipped litellm.prometheus_user_budget_label_include_email_alias after the
fixture had already built the PrometheusLogger. get_labels_for_metric now
snapshots each metric's label set at construction time, so the runtime flip
no longer reached the cached labels. Enable the flag before constructing the
logger, matching how the proxy applies config at startup.

view_logs/index.test.tsx referenced uiSpendLogsCall and moment without
importing them, and the merged index.tsx now fetches through
useLogFilterLogic (the hook the file stubs out) rather than calling
uiSpendLogsCall directly. Add the imports and restore the real hook for the
Quick Select window assertions so the call is actually observed.

* refactor(otel/v2): drop rate-limit decomposition from the LLM-call span

Proxy-side rate limits (litellm_rate_limit, budget, max_iterations) are
rejected at the gate before any upstream call, so async_post_call_failure_hook
tags the synthetic failure log with LITELLM_LOGGING_NO_UPSTREAM_LLM_CALL and the
v2 OTel logger never opens an LLM-call span for them; the
litellm.error.rate_limit_category / litellm.error.rate_limit_type attributes
were dead for exactly the cases they were meant to surface. The only failure
that does open an LLM-call span carrying a RateLimitError is a vendor 429, where
rate_limit_type is always None and the category just restates
error.type=RateLimitError.

The decomposition still reaches downstream consumers through
StandardLoggingPayload.error_information.error_rate_limit_* and the prometheus
rate_limit_category / rate_limit_type labels, both unchanged.

Removes the SpanError fields, the _parse_error reads, the genai mapper
attributes, the semconv keys, and the three span tests that asserted a scenario
that never reaches the mapper in production.

* fix(batch_rate_limiter): map max_parallel_requests to concurrent_requests

* refactor(prometheus): drop transitive fastapi import from _get_exception_class_name

Read the legacy exception_class label from a prometheus_exception_class_name
marker on ProxyRateLimitError instead of importing the proxy module, keeping
the integrations layer free of a transitive fastapi dependency.

* chore(ui): sync schema.d.ts with unified rate-limit error spec

The ProxyRateLimitError docstring flows into the proxy OpenAPI spec's 429
response description, so the generated dashboard types were out of sync.
Regenerated via npm run gen:api (Check UI API Types Sync).

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
2026-06-06 17:50:29 -07:00
ryan-crabbe-berri
f31d059aa3
feat(ui): add budget duration to edit team member form (#29717)
* feat(ui): add budget duration to edit team member form

Editing a team member created a member budget with no duration, so the
budget never reset. This threads a budget reset period through the edit
flow end to end and reuses the shared duration dropdown so the options
stay in sync with the rest of the UI.

Resolves LIT-2651

* fix(proxy): validate member budget_duration and persist clears

Reject budget_duration values that can't be parsed, are non-positive, or overflow date math before any write, so a bad value can't be persisted and later crash the budget reset job.

Clearing the budget duration in the edit-member form now sends null and clears the column end to end, so the dropdown's clear control reflects a real change instead of being a no-op

* chore(ui): regenerate schema.d.ts for member budget_duration

Adds budget_duration to TeamMemberUpdateRequest/Response in the generated dashboard types so the Check UI API Types Sync gate passes
2026-06-06 17:24:55 -07:00
Mateo Wang
aeb55e7a11
fix(mcp): highlight MCP cards red when the logged-in user is missing per-user env vars (#29856)
* fix(mcp): flag missing per-user env vars on the card for every accessible server

The dashboard MCP card grid lists servers via the registry-backed manager
(get_all_mcp_servers_unfiltered for admins in view_all mode, the allowed-context
aggregation otherwise), but the per-user env-var status endpoint that drives the
red "user fields missing" highlight resolved servers through the much narrower
get_all_mcp_servers_for_user, which only returns servers explicitly granted on
the calling key. An admin's dashboard session key carries no per-server MCP
grant, so the status feed came back empty and the card never turned red even
when the logged-in user had not filled in their required variables.

Both surfaces now share a single _resolve_accessible_mcp_servers helper, so the
status feed is computed over exactly the cards the user sees. The helper returns
servers unredacted; the status endpoint needs the raw env_vars and still only
ever reports is_set booleans, never the stored secret values.

* test(mcp): drop dead get_all_mcp_servers_for_user patch from view_all regression test

The bulk status endpoint resolves servers through _resolve_accessible_mcp_servers
now, so the old get_all_mcp_servers_for_user patch in the admin view_all
regression test is never hit. Removing it keeps the test honest about which code
path it exercises.
2026-06-06 16:51:25 -07:00
Mateo Wang
d61f7747c0
feat(bedrock): forward strict and additionalProperties to Converse toolSpec (#29814)
* feat(bedrock): forward strict and additionalProperties to Converse toolSpec

Bedrock Converse supports strict in toolSpec since 2026-02, but
_bedrock_tools_pt only whitelisted type/properties/required/name/description,
so strict: true was silently dropped and Claude-on-Bedrock ignored enum
constraints that GPT and direct-Anthropic honored. Forward strict from the
OpenAI function and additionalProperties from the schema (Bedrock requires
the latter alongside strict), passing each only when present.

https://claude.ai/code/session_01WQjWd8NfUB3vxERwudbHkv

* fix(bedrock): only forward strict tool schemas to Claude on Converse

Nova, Llama and GPT-OSS on Bedrock reject the strict field
(BedrockException 'This model doesn't support the strict field'), and the
GPT-OSS request-body test asserts strict/additionalProperties are stripped.
Forwarding them to every model broke the llm_translation suite, so gate the
forwarding on the anthropic base model since only Claude honours strict
tool schemas on Bedrock.
2026-06-06 16:28:18 -07:00
milan-berri
273855b4e2
fix(responses-bridge): map system-only chat request to system input item (#29817)
System-only chat requests mapped the system message to instructions and left
input=[], which OpenAI's Responses API rejects (it also rejects input=""). When
no other messages are present, carry the system message as a role:"system" input
item (single copy, correct role) instead of leaving input empty. Mirrors the
existing handling of non-string system content. Fixes Open WebUI new-conversation
failures on mode:responses Codex models.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-06 16:11:54 -07:00
Yassin Kortam
68d67212cd
fix: 400 on Anthropic context overflow; seed identity on failed auth (#29848) 2026-06-06 14:57:41 -07:00
Mateo Wang
33c363d4d4
Extend the record/replay proxy to chat, embeddings, moderations, rerank, and Anthropic (#29847)
* test(ci): extend record/replay proxy to chat, embeddings, moderations, rerank, anthropic

The record/replay proxy that took the gpt-image-1 spend E2E off the live OpenAI
path now fronts every provider, so the other real-provider E2Es stop paying for
and depending on live calls each commit. It keys per upstream and selects a
non-OpenAI provider by a /__recorder_upstream/<host>/ path prefix carried on the
model's api_base, since some litellm handlers (cohere rerank) drop custom
request headers. Wired into build_and_test (chat, embeddings, moderations,
image), the otel job (cohere rerank), and the anthropic-messages job via a
reusable start_openai_record_replay_proxy command.

Dropped the time.time()/uuid prompt cache-busters in the build_and_test chat
tests, whose config has the response cache off, so identical requests are
recordable. The image spend test now asserts a repeat call still bills spend,
failing loudly if the proxy response cache is ever turned on.

Responses, the anthropic passthrough, bedrock, and fake-endpoint tests are left
live: their lifecycles, api_base assertions, providers, or fake targets make a
stateless body-keyed cache either break them or add nothing.

* docs(ci): note the recorder command's OpenAI default upstream and prefix override

Addresses a review note: the shared start_openai_record_replay_proxy command
defaults the upstream to OpenAI, so a non-OpenAI model must carry the
/__recorder_upstream/<host>/ prefix on its api_base. Document that in the
command description so a future caller does not assume the default follows the
provider.
2026-06-06 14:33:42 -07:00
Shivam Rawat
fdade8a84e
Title: fix(proxy): resolve vector store file list credentials from team deployments (#29739)
* fix(proxy): resolve vector store file list credentials from team deployments

GET /v1/vector_stores/{id}/files now uses the same router credential routing as POST, including JWT team model hints and wildcard model selectors, so list requests no longer call OpenAI with Bearer None.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(proxy): authorize model hints and fix credential routing for vector store file list

Resolves three review findings on the vector store file list path.

Authorize user-controlled model hints (?model= query param and the
x-litellm-model header) against the key's and team's allowed models via
can_key_call_model / _can_object_call_model before any deployment
credentials are resolved, closing a model access bypass where a normal
key could file-list using a restricted deployment's provider credentials.

Run the managed vector store registry resolution before the model routing
hint so the managed store sets the routing model first; the hint resolver
then selects credentials matching that model instead of a team fallback
deployment, avoiding a credential/model mismatch across deployments.

Skip team-fallback deployments whose provider cannot be determined instead
of treating them as OpenAI, so a deployment without an explicit
custom_llm_provider or "openai/" prefix no longer has its credentials
injected.

* fix(proxy): enforce vector store file model auth

Ensure vector store file listing routes authorize explicit and inferred model routing before resolving deployment credentials.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(proxy): type guard vector store model hints

Keep vector store model hint authorization typed to string-only values so static checks pass.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-06 12:36:05 -07:00
Shivam Rawat
1fbb78d2a4
Title: Fix managed batch cancel credential resolution (#29734)
* Fix managed batch cancel credential resolution

Decode unified batch IDs before cancel routing and resolve litellm_credential_name to api_key in Router._acancel_batch so JWT team-scoped deployments cancel with the same credentials used at create time

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix batch cancellation credential cleanup

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-06 12:35:18 -07:00
Mateo Wang
51769a8ede
feat(fal_ai): add Nano Banana / Gemini 2.5 Flash Image generation support (#29798)
* feat(fal_ai): add Nano Banana / Gemini 2.5 Flash Image generation support

Adds a FalAINanoBananaConfig for fal.ai's Nano Banana models, exposed under
both fal-ai/nano-banana and fal-ai/gemini-25-flash-image (identical schema).
This is the migration path for fal-ai/imagen4, which fal deprecates on
2026-06-30.

The config derives the request endpoint from the model name so both aliases
route correctly, maps OpenAI image params to the fal schema (n -> num_images,
size -> nearest supported aspect_ratio, response_format ignored since the model
returns URLs), and reuses the base fal response parser. Pricing is registered
at 0.039 per image in the cost map and backup.

* fix(fal_ai): tighten nano-banana routing and guard mapped params

Match the specific gemini-25-flash-image / gemini-2.5-flash-image
aliases instead of any model containing gemini so future fal.ai
Gemini-branded models aren't silently misrouted to the nano-banana
config. Guard the param mapping on the fal-side keys (num_images,
aspect_ratio) so a pre-set mapped value is respected and an OpenAI
key is never forwarded unmapped.

* fix(fal_ai): drop non-existent gemini-2.5-flash-image routing alias

fal.ai only serves the dotted-free fal-ai/gemini-25-flash-image and
fal-ai/nano-banana endpoints. Routing the dotted gemini-2.5-flash-image
alias built a https://fal.run/fal-ai/gemini-2.5-flash-image URL that
fal.ai 404s and had no pricing entry, so spend tracking silently fell to
zero. Match only the two real endpoint slugs.
2026-06-06 11:16:44 -07:00
Mateo Wang
b3297fc2ea
feat(proxy): hot-reload .env in dev when running with --reload (#29783)
* feat(proxy): hot-reload .env in dev when running with --reload

The --reload watcher already restarts the worker on *.py and --config YAML
edits, but .env was unwatched, so changing a key there did nothing until a
manual restart. Add .env to the uvicorn reload_includes (and to the
StatReload monkeypatch, which ignores reload_includes) so an edit triggers a
worker restart.

A reloaded worker is a fresh process that inherits the reloader's
environment, so load_dotenv(override=False) would keep serving the stale
inherited value for any key already in the environment. The CLI now exports
LITELLM_DEV_ENV_HOT_RELOAD when --reload is set, and litellm/__init__.py
reads it to load .env with override=True only on that dev path, leaving
normal startup precedence untouched.

* feat(proxy): warn that --reload makes .env override shell env vars

When --reload is active, worker processes re-read .env with override=True, so
.env values win over shell-exported environment variables. Surface this dotenv
precedence change with a startup warning so a developer who relies on a
shell-exported override is not silently surprised.

* fix(proxy): type reload helper paths as Optional[str] to satisfy mypy

* fix(proxy): watch the cwd .env in both reload backends for parity

WatchFiles only watches cwd (and the --config dir) for .env, while the
StatReload fallback used find_dotenv(usecwd=True), which walks up to a
parent-dir .env that WatchFiles never sees. Point StatReload at the same
cwd .env so the two reload backends react to the same file.
2026-06-06 09:39:21 -07:00
Mateo Wang
aa7845dc5e
test(ci): make the image-gen record/replay proxy report cache mode and per-request HIT/MISS (#29802)
The recorder could come up pointed at a missing or unreachable cassette redis
and silently forward every request live; the health check still passed and the
process logged nothing, so a CI run looked identical whether it replayed from
the cassette or paid OpenAI for a fresh call every commit. There was no way to
tell from the logs whether the 24h caching was actually happening.

It now announces its mode at startup (REPLAY when the cassette redis is
reachable, PASSTHROUGH when CASSETTE_REDIS_URL is unset, DEGRADED when it is set
but the redis is unreachable) and logs a HIT/MISS line per request. _cache_set
returns whether the write landed so a mid-run redis failure surfaces as a
warning instead of masquerading as a successful record.

Adds unit tests covering the three startup modes and the HIT/MISS/not-recorded
request paths; both new behaviors were mutation-checked.
2026-06-06 09:36:06 -07:00
tin-berri
22186f457a
fix(ui): persist Tools-tab MCP OAuth token to DB (#29809) 2026-06-05 22:29:56 -07:00
Mateo Wang
4ec4ab99d0
feat(mcp): per-server env vars with global + per-user scopes (#28917) 2026-06-05 20:15:11 -07:00
yuneng-jiang
53cf3d8416
fix(proxy): drop deleted team BYOK model name from team.models (#29820)
Deleting a team-scoped BYOK model left its public name in team.models, so /models
with a team key kept listing the now-deleted "ghost" model. delete_model stripped
team.models using only litellm_modeltable alias lookups, but models added via
/model/new with a team_id never create an alias row; their public name lives only
in team.models and model_info.team_public_model_name, so it was never removed. The
team cache was also left stale because the delete path skipped _refresh_cached_team.

The cleanup now keys off team_public_model_name (falling back to alias keys), runs
after the deployment row is deleted, and strips a public name only when no remaining
team deployment still backs it, so a load-balanced replica is not revoked and
concurrent deletes cannot leave a ghost. The updated team row is refreshed in cache
so /models reflects the change immediately
2026-06-05 18:35:50 -07:00
milan-berri
b7f47a3b52
fix(jwt): use resolved DB user_id for spend on legacy email match (#29217)
* fix(jwt): attribute spend to resolved DB user_id on email/sso fuzzy match

When user_id_upsert is enabled with JWT auth and a pre-migration user row
exists whose user_email matches the JWT email but whose user_id is a UUID,
get_user_object resolves the legacy row via fuzzy lookup, but the JWT-claim
user_id (the email) still flowed into team-membership lookup,
JWTAuthBuilderResult.user_id, UserAPIKeyAuth and the spend tables. Spend was
orphaned under a phantom email id; /user/info and the Usage page showed $0
for the legacy user (GH #26789).

Treat the resolved user_object as the source of truth: add
_canonical_user_id_from_db, rebind inside get_objects, and return
effective_user_id so auth_builder unpacks it without adding statements.

Fixes #26789

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(jwt): log user_id rebind at DEBUG to avoid email PII in INFO streams

Greptile review on #29217: rebinding often logs JWT email claims at INFO.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(jwt): update passthrough allowlist mock for 5-tuple get_objects

Staging #29256 added a test that still mocked get_objects with a
4-tuple; our PR expanded the return to 5 values (effective_user_id).

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-05 15:59:41 -07:00
Sameer Kankute
95e3d136e1
test(google): add google-genai SDK proxy integration tests (#29781)
* test(google): add google-genai SDK proxy integration tests for Gemini and Vertex

Pin google-genai in the CI dependency group and exercise streaming/non-streaming
generate_content through the LiteLLM proxy in the existing unified_google_tests suite.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(test): address Greptile review for google-genai proxy SDK tests

Restore GOOGLE_APPLICATION_CREDENTIALS after the module proxy fixture tears down,
initialize temp-file tracking on the proxy SDK base class, and skip litellm reload
for proxy_genai_sdk tests so the module-scoped proxy server stays consistent.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(test): only load Vertex credentials when keys exist for proxy SDK tests

Avoid writing empty GOOGLE_APPLICATION_CREDENTIALS temp files so Vertex tests
skip cleanly without credentials, use a session-scoped proxy fixture, and clean up
per-test credential temp files.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(test): scope google-genai pin to unified_google_tests only

Remove google-genai from the ci dependency group and pin it in
tests/unified_google_tests/requirements.txt for local test installs.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(google): tie litellm reload skip to proxy fixture dependency

Replace the name-based reload guard with a check on whether the test
requests the google_genai_proxy_url fixture, so the skip stays correct
if the proxy SDK tests are renamed.

* fix(test): stop DatabaseURLSettings tests leaking DATABASE_URL into os.environ

The autouse env scrubber relied on monkeypatch.delenv, but apply_to_env
writes DATABASE_URL straight into os.environ, which monkeypatch never
tracks and therefore never undoes. The synthesized writer.example.com URL
leaked past the last test in this module and into proxy-infra tests that
read DATABASE_URL to decide whether to hit a real database, e.g.
test_deprecated_key_grace_period_cache_hit_path, turning an intended skip
into a ConnectError. Snapshot and restore the managed vars directly so the
original environment is reinstated regardless of how it was mutated.

* test(google): drop redundant per-test vertex credential setup

The session-scoped google_genai_proxy_url fixture already configures
GOOGLE_APPLICATION_CREDENTIALS before the proxy starts, and
_require_proxy_sdk skips when credentials are missing, so the per-test
_setup_vertex_credentials_if_needed helper and its temp-file tracking
never did any work. Remove it to keep the ABC self-contained.

* test(google): declare model_config contract on proxy SDK ABC

_skip_reason_if_credentials_missing reads self.model_config to pick the
provider, but that property was only declared on the sibling
BaseGoogleGenAITest. Make the dependency explicit by adding model_config
as an abstract property on BaseGoogleGenAIProxySDKTest so the ABC is
self-contained and a standalone subclass fails fast instead of hitting an
AttributeError.

* test(google): narrow streaming error catch to Exception

Catching BaseException in the streaming assertion swallowed
KeyboardInterrupt and SystemExit, turning a Ctrl-C into a test failure
message instead of letting pytest interrupt cleanly. Only genuine runtime
errors should be recorded as stream failures, so catch Exception.

* test(google): initialize proxy on the same loop that serves it

The proxy was initialized via asyncio.run() on the main thread, which
creates and tears down a throwaway event loop, while requests were served
on a separate loop in the worker thread. Any asyncio primitive bound to
the init loop would be unusable once serving started. Run initialize()
on the worker thread's loop right before server.serve() so setup and
request handling share a single event loop.

* test(google): drop redundant google-genai requirements pin

google-genai>=1.37.0,<2.0 is already declared in the proxy-runtime extra,
which the google_generate_content_endpoint_testing CI job installs via
uv sync --all-extras. The standalone tests/unified_google_tests/requirements.txt
duplicated that pin with a narrower ==1.37.0 specifier and was never
installed by CI, so it added a second source of truth without changing
what gets installed. Drop it and rely on the proxy-runtime extra.

* chore: revert incidental uv.lock exclude-newer bump

The google-genai ci pin was added and then dropped (it is already
provided by the proxy-runtime group), but each uv lock recomputed the
relative exclude-newer span, leaving only a timestamp bump in uv.lock.
Restore it to the base value so this test-only PR carries no lockfile
change.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
2026-06-05 21:05:32 +00:00
Sameer Kankute
d671a09c20
Litellm oss staging 050626 (#29774)
* Mark xAI models retiring on 2026-05-15 (#28788)

Per https://docs.x.ai/developers/migration/may-15-retirement, xAI is
retiring the following slugs on 2026-05-15 (auto-redirect to grok-4.3
with various reasoning efforts; callers continuing to use the old slugs
will be billed at grok-4.3 pricing):

  grok-4-1-fast-reasoning{,-latest}      -> grok-4.3 (low effort)
  grok-4-1-fast-non-reasoning{,-latest}  -> grok-4.3 (none)
  grok-4-fast-reasoning                  -> grok-4.3 (low effort)
  grok-4-fast-non-reasoning              -> grok-4.3 (none)
  grok-4-0709                            -> grok-4.3 (low effort)
  grok-code-fast-1{,-0825}               -> grok-build-0.1
  grok-3                                 -> grok-4.3 (none)

Only the direct xai/ slugs are tagged; third-party hosts (azure_ai,
oci, vercel_ai_gateway, perplexity/xai) run their own schedules. The
grok-3 retirement list explicitly names only the base grok-3 slug — the
-mini / -fast / -beta / -latest variants are not listed, so they remain
untouched.

* feat(moonshot): advertise json_schema response support on live models (#29683)

litellm.responses() already routes Moonshot through the responses->chat-completions
bridge, and Moonshot honors response_format json_schema on chat completions. The
cost-map entries left supports_response_schema unset, so discovery layers that gate
on that flag dropped Moonshot from structured-output / responses listings even though
the capability works end to end.

Set supports_response_schema on the nine models currently live on api.moonshot.ai:
kimi-k2.5, kimi-k2.6, the moonshot-v1 8k/32k/128k text and vision-preview variants,
and moonshot-v1-auto. Verified against the live API that each honors json_schema and
that litellm.responses() returns schema-valid structured output through the bridge.

* chore(moonshot): mark models retired from api.moonshot.ai as deprecated (#29685)

Thirteen Moonshot/Kimi models in the cost map no longer resolve on
api.moonshot.ai (all return 404). Stamp each with its deprecation_date from
platform.kimi.ai/docs/models rather than deleting the entries, so historical
cost calculation keeps resolving the names while tooling can surface the
retirement.

Dates: kimi-thinking-preview 2025-11-11; kimi-latest and its 8k/32k/128k context
variants 2026-01-28; the kimi-k2 preview/turbo/thinking series 2026-05-25; the
moonshot-v1 -0430 snapshots use their own 2024-04-30 snapshot date (Moonshot
publishes no discontinuation date for them).

* fix(moonshot): drop temperature for reasoning models (kimi-k2.5/k2.6) (#29687)

Kimi reasoning models reject every temperature except 1; a request with
temperature=0.2 returns "invalid temperature: only 1 is allowed for this model".
litellm only clamped temperature into [0.3, 1], so any value below 1 still 400'd.

Drop the temperature param entirely for reasoning models (gated on
supports_reasoning, the same signal transform_request already uses) so the model
default is used; the non-reasoning moonshot-v1 models keep the existing clamp.

Co-authored-by: Sameer Kankute <sameer@berri.ai>

* feat(mcp): add per-server timeout configuration (#29672)

* feat(mcp): add per-server timeout configuration

* fix(mcp): address timeout field review comments

- use is not None guard instead of or for 0.0 edge case
- copy timeout in both LiteLLM_MCPServerTable constructions (health check path + _build_mcp_server_table)
- add timeout Float? column to all three schema.prisma files
- extend round-trip test to cover _build_mcp_server_table direction
- add test for zero timeout not treated as falsy

* fix(mcp): forward timeout in _build_temporary_mcp_server_record

* fix(mcp): return 504 instead of 500 when per-server timeout fires

* test(mcp): add 504 timeout regression test; fix black formatting

* Add jp. Bedrock cross-region inference profile for claude-opus-4-7 (#28567)

* fix(thinking): handle None thinking param in is_thinking_enabled (#28598)

Squash-merged by litellm-agent from Terrajlz's PR.

* feat(helm): support tpl rendering in podAnnotations (#28609)

Squash-merged by litellm-agent from devauxbr's PR.

* Forward custom_llm_provider through the Responses API bridge (Fixes #28505) (#28575)

* Forward custom_llm_provider through the Responses API bridge (Fixes #28505)

When a Chat Completions request to a GPT-5.4+ model contains both
`tools` and `reasoning_effort`, `completion()` auto-routes through
`responses_api_bridge`. The bridge handler called
`litellm.responses()` / `litellm.aresponses()` without forwarding the
already-resolved `custom_llm_provider`, so the downstream call
re-invoked `get_llm_provider()` with `custom_llm_provider=None` and
stripped a second provider prefix from a `provider/provider/model`
deployment string.

For a deployment configured as `openai/openai/openai/gpt-5.5`,
the bridge flow sent `openai/gpt-5.5` to the upstream API instead of
the correct `openai/openai/gpt-5.5`. Upstream APIs that enforce
model-name allow-lists rejected this as `key_model_access_denied`.

Fix: pass the locally-resolved `custom_llm_provider` into both the
sync `responses()` and async `aresponses()` calls so the downstream
`_resolve_model_provider_for_responses` sees an explicit provider
and skips the second prefix-strip.

New regression test
`tests/test_litellm/completion_extras/test_responses_bridge_provider_propagation.py`
pins both call sites: each must forward `custom_llm_provider`.

* fix(28505): set custom_llm_provider on request_data instead of as duplicate kwarg

Greptile flagged that the previous patch passed custom_llm_provider as an
explicit kwarg to responses()/aresponses() while request_data already
carried it via the spread of sanitized_litellm_params, which would raise
TypeError: got multiple values for keyword argument on every real bridge
call.

Switches to assigning request_data['custom_llm_provider'] before the call
so the resolved provider wins over whatever sanitized_litellm_params spread
in, without duplicating the kwarg.

Updates the regression test to seed request_data with a sentinel
custom_llm_provider so it actually exercises the overwrite path (the
previous test mocked transform_request with a minimal dict and never hit
the conflict).

* chore: trigger shin-agent re-eval on retargeted staging base

* chore: trigger shin-agent re-eval against updated Greptile state

* Add jp. Bedrock cross-region inference profile for claude-opus-4-7

AWS Bedrock documents jp.anthropic.claude-opus-4-7 alongside the
existing us./eu./au./global. profiles for Claude Opus 4.7
(ap-northeast-1 Tokyo / ap-northeast-3 Osaka), but the entry is
missing from model_prices_and_context_window.json. Tokyo-region
users currently get an "unknown model" error when routing through
the JP geo profile.

Adds the entry to both the canonical file and the bundled backup,
mirroring the recent pattern for sonnet-4-6 (#27831). Pricing matches
the other regional profiles (10% premium over base/global).

Regression test pins all six documented profiles (base, global, us, eu,
au, jp) and asserts pricing parity between jp. and au. variants.

Source: https://docs.aws.amazon.com/bedrock/latest/userguide/model-card-anthropic-claude-opus-4-7.html

---------

Co-authored-by: Terrajlz <info@jouleselectrictech.com>
Co-authored-by: Bruno Devaux <devaux.br@gmail.com>
Co-authored-by: Sameer Kankute <sameer@berri.ai>

* feat(soniox): add soniox audio transcription integration (#29508)

* feat(openmeter): add OPENMETER_TRUST_REQUEST_USER to prevent forged attribution (#29650)

The OpenMeter callback resolves the CloudEvent subject from kwargs["user"]
first, then falls back to the key-bound user_api_key_user_id. For
multi-tenant proxy deployments, a client can set `"user": "..."` in the
request body and cause their usage to be attributed to that arbitrary
string — a billing-attribution forgery risk.

Adds OPENMETER_TRUST_REQUEST_USER env var (default "true" for backward
compatibility). When set to "false", the request-supplied `user` field is
ignored and the subject is resolved solely from user_api_key_user_id.

Matches the existing env-var-driven config pattern in this file
(OPENMETER_API_KEY, OPENMETER_API_ENDPOINT, OPENMETER_EVENT_TYPE).

* feat(search): add you_com as a search provider (#28370)

* feat(search): add you_com as a search provider

Registers You.com Search API as a first-class `search_provider` in the
`search_tools` registry, alongside Tavily, Exa, Perplexity, etc.

- New adapter: litellm/llms/you_com/search/transformation.py
  - POSTs to https://ydc-index.io/v1/search
  - Auth: X-API-Key from YOUCOM_API_KEY (or explicit api_key)
  - Maps Perplexity unified spec: max_results -> count,
    search_domain_filter -> include_domains, country -> country
  - Flattens results.web + results.news into a single SearchResult list;
    snippet prefers snippets[0], falls back to description; page_age -> date
- Registry: SearchProviders.YOU_COM in litellm/types/utils.py and wired
  into ProviderConfigManager.get_provider_search_config()
- Pricing entry: model_prices_and_context_window.json (placeholder $0.0;
  happy to adjust to maintainers' preferred public number)
- Docs: example router config snippet and example proxy yaml updated
- Tests: tests/search_tests/test_you_com_search.py - 5 mocked tests
  (payload shape, domain filter mapping, snippet fallback, news flattening,
  missing-api-key error)

Refs upstream expansion signal: #15942

* review fixups: normalize api_base, lowercase country, scope env-var to test

Addresses Greptile inline review comments on #28370:

- get_complete_url: strip trailing slashes from api_base *before* the
  endswith("/v1/search") check, so a custom base like ".../v1/search/"
  doesn't become ".../v1/search/v1/search".
- transform_search_request: .lower() country before sending, matching
  Tavily's convention so callers using the unified spec form ("US") get
  consistent behavior across providers.
- Tests: replace direct os.environ writes with an autouse monkeypatch
  fixture so YOUCOM_API_KEY is set per-test and removed afterwards.
  The missing-key test now uses monkeypatch.delenv. New test asserts the
  trailing-slash normalization above.

Reverts the ARCHITECTURE.md / example yaml edits per the reviewer note
that documentation changes belong in the litellm-docs repo.

* support keyless free tier (api.you.com/v1/agents/search) as default

You.com offers an IP-throttled keyless endpoint that returns the same
response shape as the keyed one (~100 queries/day, no signup). This is a
significant onboarding lever - mirrors the keyless DuckDuckGo/SearXNG
providers already in the search_tools registry.

Behavior:
- YOUCOM_API_KEY set        -> keyed:  POST https://ydc-index.io/v1/search
                                       (X-API-Key header)
- no key                    -> free:   POST https://api.you.com/v1/agents/search
                                       (no auth)
- YOUCOM_API_BASE override  -> honored as-is

Tests:
- New: test_you_com_search_keyless_free_tier - asserts URL + absence of
  X-API-Key when no key is configured.
- New: test_you_com_search_validate_environment_keyless - asserts the
  config no longer raises when the key is absent.
- Removed: test_you_com_search_raises_without_api_key (the precondition
  no longer holds).
- Existing payload/domain-filter/etc tests still cover keyed mode via
  the autouse YOUCOM_API_KEY fixture.

Verified both endpoints accept POST + return identical JSON shape:
  results.web[] / results.news[] with title, url, snippets, description,
  page_age.

* register you_com in provider_endpoints_support.json

Adding `litellm/llms/you_com/` requires a corresponding entry in
provider_endpoints_support.json or the
code-quality/check_provider_folders_documented CI check fails.

Follows the compact tavily/serper pattern - endpoints: { search: true }.
Local run of the check now reports "All 114 provider folders are documented".

* move tests under tests/test_litellm/llms/ so CI exercises them

The litellm CI workflows scope unit tests to `tests/test_litellm/...`
(see test-unit-llm-providers.yml: `tests/test_litellm/llms` path), so
tests living under `tests/search_tests/` are never run in CI - which is
why codecov reports 0% patch coverage for the new adapter even though
the unit tests exist and pass locally.

Move test_you_com_search.py into `tests/test_litellm/llms/you_com/` so
the test-unit-llm-providers job picks it up. 7/7 tests still pass at
the new location.

(Sibling search-only providers - tavily, exa_ai, brave, etc. - still
live only in `tests/search_tests/` and would benefit from the same
move, but that is out of scope for this PR.)

* fix(you_com): pin Accept-Encoding: identity to dodge keyless gzip bug

The keyless free-tier endpoint (api.you.com/v1/agents/search) advertises
Content-Encoding: gzip but returns a body that httpx's decoder rejects
with `zlib.error: Error -3 while decompressing data: incorrect header
check`, surfacing as litellm.APIConnectionError in user code. curl works
because it doesn't request compression by default.

Pin Accept-Encoding: identity in validate_environment so the upstream
server skips compression entirely. Harmless on the keyed endpoint
(ydc-index.io/v1/search) which negotiates content-encoding correctly.

The header uses setdefault so a caller-supplied Accept-Encoding still
takes precedence. (Server-side bug has been flagged to the You.com team
separately - once fixed there, this workaround can be removed.)

New unit test: test_you_com_search_pins_identity_accept_encoding.

---------

Co-authored-by: Sameer Kankute <sameer@berri.ai>

* docs: fix README typo (#29419)

Correct clear spelling mistakes in documentation without changing behavior.

Confidence: high
Scope-risk: narrow
Tested: git diff --check; uvx codespell on changed files
Not-tested: Full docs build not run; text-only changes

* Fix(langfuse): pass httpx_client to Langfuse in langfuse_prompt_management to respect SSL_VERIFY (#29480)

* fix(langfuse): pass ssl_verify to Langfuse httpx client

* fix_langfuse_

* add unit tests

* addressed comments

---------

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>

* feat(models): add minimax/MiniMax-M3 to model cost map (#29412)

Add MiniMax's new flagship MiniMax-M3 to the native minimax provider:
512K context, 128K max output, native multimodal (supports_vision),
reasoning, prompt caching. Pricing (USD/M tokens): input 0.6 / output
2.4 / cache read 0.12. M3 has no active prompt-cache-write tier, so
cache_creation_input_token_cost is omitted.

Updated both the root model_prices_and_context_window.json (remote
source) and the bundled litellm/model_prices_and_context_window_backup.json
(local fallback), keeping them in sync.

* fix(logging): handle ResponseCompletedEvent in anthropic_messages streaming spend log (#29394)

* fix(logging): handle ResponseCompletedEvent in anthropic_messages streaming spend log

* fix(logging): extend terminal event handling to ResponseIncompleteEvent and ResponseFailedEvent; fix return type annotation

* feat(provider): Add Neosantara provider as OpenAI Compatible (#29646)

* Add Neosantara provider

* Register Neosantara provider enum

* Address Neosantara provider review feedback

* Add Neosantara packaged endpoint support

---------

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>

* fix: address greptile and veria review feedback

- langfuse: guard httpx_client injection behind version check (>= 2.7.3)
- soniox: propagate audio_transcription_duration in _hidden_params for spend tracking
- soniox: give SONIOX_API_BASE env var priority over caller-supplied api_base
- mcp: replace CancelledError catch with asyncio.wait_for + TimeoutError

* chore(mcp): add migration for per-server timeout column

* fix(test): add tool_use_system_prompt_tokens to model prices schema validator

* fix: mcp timeout test uses real asyncio.wait_for timeout; you_com get_complete_url respects resolved api_key

* fix: forward resolved api_key into you_com endpoint selection and apply timeout to soniox polling GETs

The search flow resolves api_key in validate_environment but never passed it
into get_complete_url, so a programmatic api_key (with no YOUCOM_API_KEY in the
env) set the X-API-Key header yet still selected the keyless free-tier endpoint.
Forward api_key through both the search entrypoint and the http handler so the
keyed endpoint is chosen.

HTTPHandler.get/AsyncHTTPHandler.get had no timeout parameter, so the Soniox
poll and transcript-fetch GETs silently used the client global default instead
of the caller timeout. Add a per-request timeout to get() and forward the
configured timeout from the Soniox handler.

* fix(soniox): price stt-async-v4 per second so transcriptions are billed

The handler stores audio_transcription_duration in _hidden_params, but the
model carried only token cost fields and the response has no token usage, so
the transcription cost path fell through to cost_per_second and returned $0.
An authenticated caller could transcribe Soniox audio without decrementing
their budget. Switch the entry to output_cost_per_second at Soniox's published
$0.10/hour async rate so the stored duration produces a real charge.

* fix(langfuse): use a dedicated httpx client for the SDK injection

The httpx_client handed to the Langfuse SDK came from _get_httpx_client(),
which returns LiteLLM's globally cached HTTPHandler. If Langfuse closed that
client on teardown it would invalidate the shared client used by every other
LiteLLM HTTP call. Build a dedicated httpx.Client instead, still resolving SSL
verification and client certificate from LiteLLM's configuration.

* fix(soniox): prefer caller-supplied api_base over SONIOX_API_BASE env var

* fix(cohere): support max_completion_tokens on cohere v2 chat (default route) (#29779)

* fix(cohere): support max_completion_tokens on cohere v2 chat

The default cohere_chat route resolves to CohereV2ChatConfig, which did not
list or map max_completion_tokens, so get_optional_params raised
UnsupportedParamsError for the standard OpenAI parameter (the modern
replacement for the deprecated max_tokens). The v1 config already maps it to
cohere's max_tokens; mirror that in v2 and add v2 regression tests.

* fix(cohere): make max_completion_tokens take precedence over max_tokens on v2

When both max_tokens and max_completion_tokens are supplied, prefer
max_completion_tokens explicitly rather than relying on dict iteration order,
and cover both orderings with a regression test.

---------

Co-authored-by: Daniel Yudelevich <4537920+yudelevi@users.noreply.github.com>
Co-authored-by: hectorc98 <hector.chamorroalvarez@adyen.com>
Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com>
Co-authored-by: Terrajlz <info@jouleselectrictech.com>
Co-authored-by: Bruno Devaux <devaux.br@gmail.com>
Co-authored-by: Dan Lemon <dan@danlemon.com>
Co-authored-by: Saswat <saswatds@users.noreply.github.com>
Co-authored-by: Brian Sparker <brainsparker@users.noreply.github.com>
Co-authored-by: Zhao73 <156770117+Zhao73@users.noreply.github.com>
Co-authored-by: Urain Ahmad Shah <60431964+urainshah@users.noreply.github.com>
Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: kape <168134658+kapelame@users.noreply.github.com>
Co-authored-by: danisalvaa <159898202+danisalvaa@users.noreply.github.com>
Co-authored-by: Just R <remixingmagelang@gmail.com>
Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>
Co-authored-by: abhay23-AI <abhaytrivedi22@gmail.com>
2026-06-05 13:51:51 -07:00
Mateo Wang
84247d954d
test(ci): record/replay OpenAI image gen so the spend E2E isn't outage-bound (#29787)
* test(ci): record/replay OpenAI image gen so the spend E2E isn't outage-bound

The dockerized spend test test_key_info_spend_values_image_generation curls
the proxy for a gpt-image-1 image, which wildcard-routes to real api.openai.com
on every commit; an OpenAI outage then reddens unrelated PRs and each run pays
for an image.

Add an in-repo record/replay reverse proxy (tests/_openai_record_replay_proxy.py)
that sits between the proxy and OpenAI. The first run, and the first after the
recording lapses, records live; subsequent runs replay from the shared Redis
cassette store. The proxy keeps its real separate-process HTTP topology; only
the image model's api_base is pointed at the recorder in CI via
IMAGE_GEN_RECORDER_BASE_URL, which is unset elsewhere so it falls back to
api.openai.com.

Recordings lapse 24h after write and are never refreshed on read, matching the
VCR persister contract, so provider drift is still caught. Replayed responses
drop upstream framing/server headers (content-length, transfer-encoding,
content-encoding, date, server) so the re-serving layer recomputes them,
honoring the Bedrock content-length lesson.

* test(ci): close recorder http client on app shutdown

Add a Starlette lifespan that closes the self-created httpx.AsyncClient on
teardown, and leave caller-injected clients untouched so reuse across
create_app calls is not broken. Covers the unclosed-client ResourceWarning
raised in review.
2026-06-05 10:27:23 -07:00
Mateo Wang
939cff0455
test(vcr): stop refreshing cassette TTL on read so cassettes lapse after 24h (#29784)
The Redis cassette persister slid the 24h TTL forward on every successful
read, so any cassette replayed at least once per day never expired. With CI
running more than once a day that means a recorded response is replayed
forever and the suite never re-hits the provider, so a changed request or
response contract goes undetected indefinitely.

Drop the refresh-on-read. The TTL now counts down from the last write, so a
cassette lapses 24h after it was recorded and the next run past that point
re-records live and catches provider drift. Per-commit runs in between still
replay from cache; only the one boundary-crossing run goes live.
2026-06-05 10:22:41 -07:00
Sameer Kankute
074455c138
fix(auth): expand all-team-models sentinel in can_key_call_model for batch validation (#29746)
* fix(auth): expand all-team-models sentinel in can_key_call_model

Keys with models=["all-team-models"] were denied during batch JSONL
model validation because can_key_call_model matched the literal string
against the model name. Add _resolve_key_models_for_auth_check to
expand the sentinel to team_models before the check, consistent with
get_key_models in model_checks.py and the completion-route bypass.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(auth): document empty team_models unrestricted access behavior; add regression test

Adds a docstring note to _resolve_key_models_for_auth_check explaining that
when team_models is empty, all-team-models resolves to [] which is treated as
unrestricted access (consistent with get_key_models behavior on other auth
paths). Adds a test to lock in this behavior.

* fix(auth): deny all-team-models access when key has no team_id

A key configured with models=["all-team-models"] but no team_id could
previously resolve to an empty allowlist, which _check_model_access_helper
treats as unrestricted access. Now the sentinel is only expanded when
team_id is set; otherwise the unresolved sentinel stays in the model list
and causes a deny (no real model name matches it). Same fix applied to
get_key_models in model_checks.py for consistency across batch and
non-batch auth paths.

* style: black format model_checks.py

* Fix batch all-team-models auth

* style: black format batch_rate_limiter.py

* fix(test): add tool_use_system_prompt_tokens to model prices schema validator

* fix(batch): catch get_team_object errors to avoid 404 escaping batch auth

* fix(batch): apply per-member model scope check after team auth in batch validation

* Fail closed on batch team auth fetch errors

* test(batch): cover team_object grant and member-scope denial in batch auth

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>
2026-06-05 09:04:45 -07:00
Sameer Kankute
89f177b7b6
fix(galileo): use ingest traces API and standard logging payload (#29651)
* fix(galileo): use ingest traces API and standard logging payload

Switch hosted Galileo logging to /ingest/traces with nested trace/span payloads, read metrics from standard_logging_object, and include cost and total tokens on trace metrics.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(galileo): route username/password auth to v2 traces ingest

Hosted Galileo no longer serves /observe/ingest; JWT login should post the same trace payload to /v2/projects/{project_id}/traces.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(galileo): address Greptile review on logging and timestamps

Use debug-level logs for per-request Galileo callback messages and fall back to start_time/end_time when standard_logging_object omits startTime/endTime.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(galileo): add Galileo to proxy UI callback configuration

Expose Galileo in the admin callback selector and config APIs so credentials can be configured through the dashboard instead of YAML only.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(galileo): align response type logging with Langfuse

Mirror Langfuse input/output handling for rerank, speech, transcription,
realtime, pass-through, and other response types so Galileo ingest no longer
skips supported call types.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(galileo): redact trace payload in debug logs and format with black

Avoid logging prompts and model responses in flush debug output while
keeping structural metadata for troubleshooting.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(galileo): stop logging full trace payload in debug output

Log only flush URL and trace count so prompts and model responses are not
written to application logs when debug logging is enabled.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix Galileo token totals and prompt messages

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-05 09:03:17 -07:00
Mateo Wang
ffd0e9fa7f
[internal copy of #27491] fix(realtime): Fix Realtime Audio Token Cost Tracking (#29722)
* Normalize Realtime usage dict keys before ResponseAPIUsage transform

* Test usage transform for Realtime versus tokens_details keys

* Avoid usage_input dict in-place

* Fix audio cost calculation

* fix(responses): forward output audio_tokens into completion usage details

Pass audio_tokens from output_tokens_details into CompletionTokensDetailsWrapper
so cost can use output_cost_per_audio_token. Support dict output details like
prompt path. Extend tests for Realtime and mixed completion audio.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix audio token usage formatting

* style: Black-format Realtime usage and completion usage merge

Resolve combine_usage_objects and responses/utils wrapping for CI black --check.
Restore model_fields comments above completion_tokens_details merge loop.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add test to cover combined usage objects

* Fix merge conflict with test cases

Removed unnecessary import statement and cleaned up assertions in test.

* fix(cost_calculator): remove dead None guard in completion_tokens_details combiner

---------

Co-authored-by: Liam McDonald <lmcdonald@godaddy.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-05 18:53:17 +05:30
michelligabriele
3f79222350
fix(proxy): persist oauth2_flow on MCP server registration (#29690) 2026-06-05 18:52:52 +05:30
Mateo Wang
1c741b91c0
fix(anthropic): route Claude Opus 4.8 through adaptive thinking (#29702)
* fix(anthropic): route Claude Opus 4.8 through adaptive thinking

Opus 4.8 uses the same adaptive thinking contract as 4.6/4.7
(thinking.type=adaptive plus output_config.effort), but
_is_adaptive_thinking_model only recognized 4.6/4.7 by name and otherwise
leaned on the supports_adaptive_thinking cost-map flag. The Bedrock,
Vertex, and Azure 4.8 entries don't carry that flag, so a
bedrock/us.anthropic.claude-opus-4-8 request fell back to the legacy
thinking.type=enabled shape and Bedrock rejected it with "thinking.type.enabled
is not supported for this model".

Add _is_claude_4_8_model and wire it in next to the existing 4.6/4.7
matchers in the adaptive-thinking detection, the effort=max gate, and the
supported-params check, so every provider path treats 4.8 as adaptive
regardless of whether its cost-map entry advertises the flag.

* refactor(anthropic): drive Opus 4.8 adaptive thinking from the cost map

Replace the _is_claude_4_8_model name matcher with cost-map data. Add
supports_adaptive_thinking to every Opus 4.8 provider variant (Bedrock
regional/global, Vertex, Azure) in both the root and bundled cost maps, and
move the prefix-resolving capability lookup (_supports_model_capability) down
to AnthropicModelInfo so _is_adaptive_thinking_model reads the flag through the
bedrock/invoke/, bedrock/, and vertex_ai/ prefixes. The 4.6/4.7 name checks
stay as a fallback since their provider entries don't carry the flag yet.

A pure data fix is not enough on its own: _supports_factory doesn't strip the
us.anthropic./invoke/ prefixes, so bedrock/invoke/us.anthropic.claude-opus-4-8
would still miss the flag without the resolver change.

Add a cost-map guardrail test asserting every claude-opus-4-8 variant carries
the flag, so a future variant added without it fails CI instead of silently
sending the legacy thinking.type=enabled shape that the provider rejects.
2026-06-05 16:19:01 +05:30
Mateo Wang
778a7f752d
Support OAuth M2M for Databricks Apps A2A agents (#29586)
* Add OAuth M2M support for A2A agents targeting Databricks Apps

Databricks App endpoints reject static bearer tokens and require a
short-lived OAuth token minted via the workspace OIDC token endpoint.
A2A agents could previously only authenticate outbound with static_headers
or client header passthrough, so Databricks App agents could not be
registered.

Agents configured with a databricks_oauth block in litellm_params now mint
and cache a client_credentials token and attach it as the outbound
Authorization header on both message/send and message/stream calls,
overriding any statically configured Authorization.

* Add tests covering Databricks App OAuth token error paths

Cover the HTTP status error, transport error, non-object JSON body, and
invalid expires_in fallback branches in the token cache so the failure
handling is locked in by regression tests.

* Harden Databricks App OAuth token cache

Cap the cache TTL at the token's own lifetime so a token whose validity is
shorter than the refresh buffer is never cached and served stale; include a
digest of client_secret in the cache key so a rotated secret mints a fresh
token instead of reusing the old one; and prune the per-key lock when its
cached token is evicted so the lock map stays bounded by the live key set.

* Clear per-key locks on Databricks OAuth cache flush

* fix(a2a/databricks): mint OAuth token via Basic auth header, not unsupported auth= kwarg

litellm's AsyncHTTPHandler.post (what get_async_httpx_client returns) has no
auth parameter, so minting a Databricks App OAuth token raised
"AsyncHTTPHandler.post() got an unexpected keyword argument 'auth'" before any
network call ever left the proxy, breaking the feature end to end. The handler
also calls raise_for_status() internally and re-raises a MaskedHTTPStatusError
(a subclass of httpx.HTTPStatusError), so the explicit raise_for_status() after
post() was dead code.

Build the HTTP Basic Authorization header by hand and pass it via headers, which
is what the Databricks workspace OIDC token endpoint documents for client
authentication. The token-cache tests now model the real handler contract with
create_autospec so the rejected auth= signature is enforced; the previous mocks
accepted any kwargs and silently hid the bug.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* Prune Databricks OAuth lock on the short-lived-token path

When expires_in is below the refresh buffer the token is intentionally
not cached, so _remove_key never runs for that key and the per-key lock
created by _get_lock leaked permanently. Drop the lock in that branch so
_locks stays bounded by the live key set, and assert the cleanup in the
short-lived-token test

* Gate A2A Databricks OAuth on the databricks_oauth block at the call site

Make the gating explicit where the header is applied so it is clear that only
agents configured with a databricks_oauth block enter the OAuth path; every
other agent is left untouched. Add a regression test asserting a non-Databricks
agent never invokes the token resolver.

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
2026-06-04 23:03:37 -07:00
Sameer Kankute
2b7c97bff6
fix(vertex/anthropic): handle namespace tools and strip client_metadata for codex compatibility (#29489)
* fix(vertex/anthropic): handle namespace tools and strip client_metadata for codex compatibility

* fix(anthropic): cast nested namespace tools to fix mypy error, skip nameless flat tools
2026-06-04 22:57:16 -07:00
Mateo Wang
df704d9016
fix(proxy/hooks): populate llm_provider on internal rate-limit errors (#27707)
* feat(proxy/hooks): add ProxyHTTPRateLimitError + provider resolver

Introduces a small helper layer used by every proxy-side rate-limit
hook so that the 429 they raise carries a populated llm_provider /
model — instead of an empty exception.llm_provider that downstream
loggers (Prometheus failure metric, observability callbacks) read as
'no provider attribution'.

ProxyHTTPRateLimitError inherits from both fastapi.HTTPException
(so the proxy server still renders it as a 429) and
litellm.exceptions.RateLimitError (so isinstance checks and
PrometheusLogger._get_exception_class_name pick up llm_provider).
We deliberately don't call RateLimitError.__init__ — it constructs
an httpx.Response we don't need and would just add failure surface;
attribute parity is what downstream consumers care about.

resolve_llm_provider_for_rate_limit() wraps litellm.get_llm_provider
defensively. Internal limiter hooks fire from async_pre_call_hook —
well before get_llm_provider runs anywhere else in the request
lifecycle — so we have to call it ourselves at raise time. If the
model is missing or unparseable (alias, router-only model) we fall
back to llm_provider='litellm_proxy' rather than letting a second
exception leak out and break the request path.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(proxy/hooks): populate llm_provider on parallel-request 429s

Both v1 and v3 parallel-request limiters fired bare HTTPException(429)
from inside async_pre_call_hook. The downstream Prometheus failure
metric reads exception.llm_provider via _get_exception_class_name —
the empty value showed up as exception_class='HTTPException' and
left model_id='None' on the time series.

Threads requested_model through every raise site in:

* parallel_request_limiter.py:
  - check_key_in_limits (the per-key/per-model/per-user/per-team/
    per-customer over-limit path)
  - raise_rate_limit_error (zero-limit + global_max_parallel_requests
    paths) — now takes an optional requested_model kwarg
* parallel_request_limiter_v3.py:
  - _handle_rate_limit_error (the OVER_LIMIT translator), called
    from both the should_rate_limit pre-check and the TPM
    reservation path

Resolved via resolve_llm_provider_for_rate_limit so unknown / missing
models silently fall back to llm_provider='litellm_proxy' instead of
breaking the request path with a second exception.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(proxy/hooks): populate llm_provider on dynamic-rate-limit 429s

Same plumbing change as the parallel limiters, applied to both
dynamic_rate_limiter (v1) and dynamic_rate_limiter_v3:

* v1: TPM-zero and RPM-zero paths in async_pre_call_hook now resolve
  data['model'] -> (model, llm_provider) once and pass it into both
  raises.
* v3: All three raise sites in _check_rate_limits — the
  model_saturation_check enforced raise, the priority_model
  enforced raise, and the fail-closed unknown-descriptor branch —
  now attribute the 429 to the actual provider.

Falls back to llm_provider='litellm_proxy' when the model can't be
resolved.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(proxy/hooks): populate llm_provider on batch-rate-limit 429s

batch_rate_limiter._raise_rate_limit_error now takes a
requested_model kwarg threaded from data['model'] in
_check_and_increment_batch_counters. The batch-creation 429 is what
gets raised when the input file's tokens/requests count would push
the per-key TPM/RPM window over its limit.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(proxy/hooks): populate llm_provider on budget/iterations 429s

Final batch of internal raise sites — the user/session-budget and
max-iterations hooks. Same pattern: resolve data['model'] once at
raise time, attach to ProxyHTTPRateLimitError so Prometheus and
observability callbacks can attribute the 429.

Hooks updated:
* max_budget_limiter (per-user max_budget exceeded)
* max_iterations_limiter (per-session agent iteration cap)
* max_budget_per_session_limiter (per-session dollar cap)

All three fall back to llm_provider='litellm_proxy' when data['model']
is missing or unparseable. Drops the now-unused HTTPException import
from each module.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(proxy/hooks): pin provider field on internal rate-limit 429s

Regression coverage for the 'provider field missing' bug across every
proxy-side rate-limit hook + the helper layer:

* ProxyHTTPRateLimitError class shape (HTTPException + RateLimitError,
  dict-detail stringification, None-provider normalization).
* resolve_llm_provider_for_rate_limit happy paths
  (gpt-4o-mini, anthropic/..., bedrock/...) plus all three fallback
  branches (None, '', unknown name) plus a 'get_llm_provider raises'
  case that asserts we swallow the secondary exception.
* For each limiter (parallel v1/v3, dynamic v1/v3, batch,
  max_budget, max_iterations, max_budget_per_session): assert the
  raised exception is a RateLimitError carrying the resolved
  model + llm_provider, and a sibling test that asserts the
  fallback path returns 'litellm_proxy' without leaking a second
  exception.
* Two PrometheusLogger._get_exception_class_name pins so the
  Prometheus failure metric label flips from 'HTTPException' to
  'Openai.ProxyHTTPRateLimitError' (or 'Litellm_proxy.*' on
  fallback) — that's what dashboards consume.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* perf(proxy/hooks): defer provider resolution to over-limit branches

* fix: use error_message in raise_rate_limit_error to avoid literal 'None' in detail

* Consolidate rate_limiter_utils imports in dynamic_rate_limiter

* fix(proxy): set num_retries/max_retries on ProxyHTTPRateLimitError

ProxyHTTPRateLimitError inherits from RateLimitError but did not call
RateLimitError.__init__, so num_retries/max_retries were never set.
When Starlette's HTTPException lacks __str__, MRO falls through to
RateLimitError.__str__, which unconditionally reads these attributes
and raises AttributeError during logging/traceback formatting.
Initialize them to None defensively.

* fix(mypy): silence base-class status_code conflict on ProxyHTTPRateLimitError

HTTPException declares 'status_code: int' while openai.RateLimitError
(via APIStatusError) declares 'status_code: Literal[429] = 429'. Mypy
flags the multi-base override as [misc] in CI lint. The runtime semantics
are fine (we set self.status_code in __init__), so silence the
class-level annotation conflict with a targeted ignore.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
2026-06-04 22:46:08 -07:00
Mateo Wang
812a2217ca
[internal copy of #29511] feat(guardrails): add sensitive data routing to on-premise models (#29531)
* feat(guardrails): add sensitive data routing to on-premise models

When a guardrail detects sensitive data, route to an on-premise model
instead of blocking or redacting. All subsequent requests in that
session continue routing to the same model (sticky routing).

New config options for guardrails:
- on_sensitive_data: 'block' (default) or 'route'
- sensitive_data_route_to_model: target model for rerouting
- sticky_session_routing: persist routing for session (default: true)

New exception SensitiveDataRouteException triggers rerouting when raised
by guardrails. The proxy catches it, stores the routing decision in
cache, and modifies the request's model field.

New hook _PROXY_SensitiveDataRoutingHandler checks incoming requests
against cached routing decisions and applies sticky routing.

https://claude.ai/code/session_01SQd4isBa3UyouRoGVou9dK

* fix: black formatting for custom_guardrail.py

https://claude.ai/code/session_01SQd4isBa3UyouRoGVou9dK

* test: improve test coverage for sensitive data routing feature

Add additional tests for:
- Cache key format and TTL constants
- Session ID extraction from multiple locations
- Custom guardrail initialization with routing config
- Exception string representation and custom messages
- Redis cache paths including fallback behavior
- Edge cases in pre-call hook

https://claude.ai/code/session_01SQd4isBa3UyouRoGVou9dK

* fix: use correct GuardrailRaisedException parameters

Replace invalid 'source' parameter with 'guardrail_name' to match
the exception's actual signature.

https://claude.ai/code/session_01SQd4isBa3UyouRoGVou9dK

* test: move sensitive data routing tests to hooks directory

Move test file to align with source code structure.

https://claude.ai/code/session_01SQd4isBa3UyouRoGVou9dK

* fix(guardrails): honor sticky_session_routing flag and scope session routing per API key

Propagate sticky_session_routing through SensitiveDataRouteException so a
guardrail configured with sticky_session_routing=False reroutes only the
triggering request without persisting a session override. Scope the routing
cache key to the requesting API key so sessions from different tenants cannot
collide, and warn when sticky routing is requested but the hook is not
registered.

* refactor(guardrails): dedupe session-id extraction and drop redundant import

Extract the shared session-id lookup into get_session_id_from_request_data
so the sensitive-data routing hook and CustomGuardrail no longer keep two
identical copies of the logic. Remove the redundant local import of
GuardrailRaisedException in handle_sensitive_data_detection, and document
that detection_info is surfaced in request metadata and logs so it must not
carry raw sensitive values.

* fix(guardrails): guard None user_api_key_dict in sensitive data route handler

* fix(responses): send application/json Content-Type on responses DELETE

OpenAI's responses DELETE endpoint now rejects requests that arrive without
a Content-Type header, defaulting them to application/octet-stream and
returning 'Unsupported content type: application/octet-stream'. The delete
handler sent no body and therefore no Content-Type, so the request failed.
Declare application/json on the delete request, matching the OpenAI SDK.

* fix(guardrails): backfill in-memory cache after redis hit in sensitive data routing

When _get_routed_model resolves a routing override from Redis it now also
populates the local in-memory cache. Without the write-back, a non-writing
instance that only ever reads from Redis would lose the sticky routing
decision the moment Redis became unavailable, silently reverting sensitive
sessions to the default model.

* fix(guardrails): scope sticky sensitive-data routing to JWT principal

Keyless auth (JWT and similar) has no api_key, so every such caller shared
the "default" cache namespace. One authenticated user could reuse another
user's session_id, trip the guardrail, and silently force the other user's
subsequent requests onto the cached on-prem model for the TTL.

Resolve the routing tenant from the api_key when present, otherwise from a
stable principal built from the user/team/org identity, before reading or
writing the session route.

* fix(guardrails): require route target model when on_sensitive_data='route'

* fix(guardrails): mark user_api_key_dict Optional in sensitive-data route handler

* fix(guardrails): use remaining redis ttl for local backfill and str env default

* fix(guardrails): graceful block when routing configured but no session_id

handle_sensitive_data_detection promised to raise only SensitiveDataRouteException
or GuardrailRaisedException, but when routing was configured and the request had no
session_id it let a ValueError from raise_sensitive_data_route_exception propagate,
surfacing as an HTTP 500 instead of a block. Fall back to a graceful block in that
case so the documented contract holds.

* fix(guardrails): run remaining guardrails after sensitive-data reroute

Defer the SensitiveDataRouteException until every guardrail in the
pre-call loop has run, so downstream security guardrails are no longer
skipped when an earlier guardrail triggers routing. The first reroute
wins and a later guardrail that blocks still propagates.

Also normalize on_sensitive_data to lowercase like sibling on_* config
fields so case-insensitive values are accepted.

* fix(guardrails): classify sensitive-data reroute as guardrail intervention

* fix(guardrails): record sensitive-data reroute as prometheus intervention not error

* fix(guardrails): record service span for routing guardrail and move case-normalizer to base params

Drop the early continue so a guardrail that signals sensitive-data routing still
emits its PROXY_PRE_CALL service span like every other callback.

Move the lowercase normalizer onto BaseLitellmParams so on_sensitive_data is
normalized consistently when BaseLitellmParams is constructed directly, matching
the cross-field route->model validator that already lives on the base.
2026-06-04 22:22:28 -07:00
yuneng-jiang
56aa55b991
fix(proxy): stop team BYOK model name corruption on model edit (#29731)
* fix(proxy): stop team model name corruption on edit (#28382) (#29001)

Team-scoped ("Team-BYOK") models store an internal routing key
model_name_{team_id}_{uuid} in the model_name column and the user-facing
name in model_info.team_public_model_name. The internal name leaked into
/v1, /v2, and /model/info responses; the dashboard bound its edit form to
it, so any non-rename save (e.g. a TPM tweak) PATCHed the internal name
back. The update path then treated it as a rename, overwriting
team_public_model_name and rewriting the team's models[] ACL with the
mangled string -- breaking team key calls with team_model_access_denied.

Two-layer fix:

- Read path (root cause): add _translate_model_name_for_response and apply
  it in model_info_v2 and _get_proxy_model_info so /v1, /v2, and
  /model/info surface the public name for team-scoped rows. The DB column
  and router index keep the internal name as the routing key; this is a
  presentation-layer swap on a shallow copy (never mutates input).

- Write path (defense in depth): harden _get_public_model_name so a value
  matching the internal shape, or a no-op against the current DB column,
  is never treated as a rename -- for both the top-level model_name and an
  explicit model_info.team_public_model_name.

Tests: regression for the reported scenario, full branch coverage of
_get_public_model_name, two internal-shape guard cases, an end-to-end
PATCH through _update_team_model_in_db (asserts the team ACL is untouched),
and four response-translation cases. 60 passed (model management),
181 passed (proxy server).

* fix(ui): key Agent Builder agent selection on model_info.id (#29729)

* fix(ui): key Agent Builder agent selection on model_info.id

Once team-scoped BYOK models can share a public name (the backend now
returns the public name on /model/info instead of the internal routing
key), selecting agents by model_name collides. Key selection, create,
update and delete on the stable model_info.id instead, falling back to
model_name only for config-defined agents that have no id.

* fix(ui): add name-match fallback to post-create agent selection

If the just-created agent's id is not yet present in the re-fetched
list, try matching by name before falling back to the first agent.
Addresses greptile review on #29729.

---------

Co-authored-by: tushar8408 <32977767+tushar8408@users.noreply.github.com>
2026-06-04 20:40:40 -07:00
Shivam Rawat
3bd89f209e
Litellm jwt mapping virtualkeys (#28510)
* restore an explicit no-match policy

* fix(jwt): fix AUTO_REGISTER sentinel bypass, race condition, and inline import comment

- AUTO_REGISTER now evicts stale __NO_MAPPING__ sentinel instead of silently
  returning None when cached under a prior fallback_team_mapping config
- Race condition in _auto_register_jwt_mapping: catch P2002 unique-constraint
  violation on concurrent creates, fetch the winning mapping, proceed cleanly
- Added comment on inline generate_key_helper_fn import explaining the circular
  dependency (key_management_endpoints imports user_api_key_auth at line 51)
- 3 new tests: stale sentinel eviction, race condition winner fallback, and the
  existing auto_register happy path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(jwt): cache __NO_MAPPING__ sentinel before raising 403 in REJECT mode

REJECT mode was raising HTTPException immediately on a DB miss without writing
the __NO_MAPPING__ sentinel, causing every subsequent rejected request to
re-query the DB. Write the sentinel first so repeated rejections are served
from cache within virtual_key_mapping_cache_ttl.

Adds test asserting DB is not hit on the second reject after a cache-warm miss.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(jwt): enforce no-match policy when prisma_client is None

The early `if prisma_client is None: return None` guard ran before the
no-match policy check, silently bypassing REJECT and AUTO_REGISTER — every
JWT client fell through to team auth regardless of configuration.

Fix: treat prisma_client=None as a definitive DB miss and fall through to the
same policy block as a real miss. REJECT now raises 403, AUTO_REGISTER raises
500 with a clear message (can't create keys without a DB), FALLBACK_TEAM_MAPPING
returns None unchanged.

Adds three tests: REJECT/403 with no DB, FALLBACK returns None with no DB,
AUTO_REGISTER/500 with no DB.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(jwt): consistent AUTO_REGISTER on cached sentinel; clean up race orphans

Addresses Greptile review on PR #25570 cherry-pick.

1. Inconsistent AUTO_REGISTER when __NO_MAPPING__ sentinel is cached:
   The cached-sentinel branch silently returned None when prisma_client was
   None, while the fresh path raised HTTP 500 under the same config. Same
   request, different access-control outcome depending on cache state. Both
   paths now raise the same 500.

2. Orphaned virtual keys from race-condition losers:
   On unique-constraint conflict, generate_key_helper_fn had already persisted
   an unrestricted virtual key in LiteLLM_VerificationToken with the cleartext
   in request memory. Under sustained concurrency these accumulated
   indefinitely. The loser now deletes its orphan before falling back to the
   winner's mapping; failure to delete is logged but does not fail the request.

Also corrects a latent FK bug surfaced while fixing #2: the mapping row was
storing the plaintext key in LiteLLM_JWTKeyMapping.token, but that column FKs
to the hashed LiteLLM_VerificationToken.token — now hashed at the call site.

Tests:
- updated test_auto_register_creates_key_and_mapping to assert the hashed
  token is stored, not the plaintext
- updated test_auto_register_race_condition_unique_conflict to assert the
  orphan is deleted with the correct hashed token
- added test_auto_register_raises_500_when_sentinel_cached_and_no_db
- added test_auto_register_race_conflict_tolerates_delete_failure

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(jwt): close REJECT bypass when JWT omits the configured claim field

A JWT presented without the configured `virtual_key_claim_field` previously
returned None at the `claim_value is None` guard before the
`unregistered_jwt_client_behavior` check ran. A caller who knows the configured
claim-field name could bypass REJECT by simply omitting that field and falling
through to team-based JWT auth.

Apply the no-match policy on a missing claim:
  - REJECT          → 403
  - AUTO_REGISTER   → 403 (no stable identity to map; refuse rather than
                     create a sentinel-keyed record)
  - FALLBACK_TEAM_MAPPING → return None (unchanged, backward-compatible)

Adds three tests covering each branch of the missing-claim path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(jwt): AUTO_REGISTER inherits team_id so keys are bounded by team limits

Auto-registered virtual keys were created with no team, model, route, rate, or
budget constraints — broader access than the standard team-based JWT auth path
the same client would have taken. Under AUTO_REGISTER, resolve the team_id
from the JWT (via the operator-configured team_id_jwt_field / team_id_default)
and stamp it on the new key. Downstream auth then applies the team's
budget/models/tpm/rpm/allowed_routes via the existing virtual-key flow.

Policy when team_id_jwt_field is configured:
  - JWT carries team claim → stamp resolved team_id
  - JWT lacks claim + team_id_default set → stamp default
  - JWT lacks claim + no default → 403 (refuse to create an unbounded key)

When neither team_id_jwt_field nor team_id_default is configured, the
operator has explicitly opted out of team-based limits — the auto-created
key has no team_id (matches what team-auth would do in the same config).

Adds 4 tests covering each branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(jwt): make AUTO_REGISTER functional in prod; raise on missing winner

Two correctness fixes flagged by Greptile on the AUTO_REGISTER path:

1. generate_key_helper_fn was called without table_name="key". Without that,
   the helper falls into the user-upsert branch (table_name in (None, "user"))
   and tries to insert into LiteLLM_UserTable with user_id=None, which hits
   the NOT NULL @id constraint. AUTO_REGISTER would never have succeeded in
   production. Now passes table_name="key" explicitly, matching the
   /key/generate caller.

2. When the race loser refetches the winner's mapping and gets None (winner
   row concurrently deleted), the previous code returned None — and the
   caller in _resolve_jwt_to_virtual_key then fell through to less-
   restrictive team-based JWT auth, silently bypassing the configured
   AUTO_REGISTER policy. Now raises HTTP 503 so the caller retries against
   a stable state rather than getting unintended fallback access.

Adds one test for the 503 winner-vanishes path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(jwt): defer AUTO_REGISTER until JWT policy is enforced by auth_builder

Closes the JWT policy bypass on the AUTO_REGISTER path flagged by veria-ai.

Before: when unregistered_jwt_client_behavior=auto_register and the JWT's
claim was unmapped, _resolve_jwt_to_virtual_key validated the JWT signature
and then immediately created a virtual key + mapping. JWTAuthManager.auth_builder
never ran for the first request (the new key short-circuited the team-auth
path), and every subsequent request hit the cached mapping — so custom_validate,
RBAC, scope_mappings, and user_allowed_email_domain were never enforced for
auto-registered clients.

After: _resolve_jwt_to_virtual_key returns a _PendingAutoRegister signal
instead of creating the key. The caller in _user_api_key_auth_builder runs
JWTAuthManager.auth_builder, then — only on a validated, policy-passing
result — calls _auto_register_jwt_mapping with the team_id / user_id from
that result. The created key inherits team + user limits from the validated
identity, and future cache hits load that already-policy-checked key.

Also drops the interim _resolve_inherited_team_id helper that pulled team_id
from raw JWT claims — same bypass risk; team_id now comes exclusively from
auth_builder.

Tests:
  - Rewrote two existing tests to assert _resolve_jwt_to_virtual_key returns
    _PendingAutoRegister (no key created yet) for both the fresh-DB-miss
    and stale-sentinel branches
  - Added a contract test that _auto_register_jwt_mapping stamps the
    validated team_id/user_id onto generate_key_helper_fn
  - Removed four stale team-binding tests that exercised the prior
    raw-claim helper

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Update user_api_key_auth.py

* fix(jwt): cache proxy-admin AUTO_REGISTER path to avoid repeated DB lookups

Cache-miss regression introduced by the deferred-auto-register refactor:
when a JWT under AUTO_REGISTER resolved to a proxy admin, the is_proxy_admin
early-return in _user_api_key_auth_builder ran *before* the pending
auto-register cache-write block. Result: no cache entry, so every
subsequent proxy-admin request re-queried get_jwt_key_mapping_object
indefinitely.

Fix: write a __JWT_PROXY_ADMIN__ sentinel to user_api_key_cache before the
early return when a pending auto-register existed. _resolve_jwt_to_virtual_key
treats that sentinel as "skip mapping, fall through to auth_builder", so
future requests from the same JWT identity hit the cache instead of the DB.
auth_builder still runs full JWT policy on every request — only the
mapping DB lookup is short-circuited.

Adds one test asserting the sentinel cache-hit returns None without
hitting prisma_client.db.litellm_jwtkeymapping.find_first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(proxy): stamp org context on JWT auto-registered keys

AUTO_REGISTER keys were created with team_id and user_id only, so org budget checks were skipped after switching to the key-scoped path.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-04 19:00:36 -07:00
ryan-crabbe-berri
770fff7058
test(proxy): stop running real-DB tests in GitHub Actions unit jobs (#29700)
* test(proxy): stop running real-DB tests in GitHub Actions unit jobs

GitHub Actions unit jobs were spinning up a Postgres service container, but
the only active tests that touched it either used the DB incidentally (a
cargo-culted prisma_client.connect()) or were genuine integration tests
mislabeled as unit. Mock the incidental ones so the proxy-db job needs no
container, and move the tests that genuinely need a database (proxy
management behavior, master-key-not-persisted, schema-migration sync) to
CircleCI, which is already the real-infrastructure lane.

* test(proxy): restore no-unexpected-startup-writes canary in master-key test

Greptile noted the hash-match assertion no longer catches other unexpected
startup writes (a default key, a rotation artifact). The CircleCI job gives
each run a fresh DB, so a clean startup must leave the table empty; add that
canary back alongside the precise master-key assertion.
2026-06-04 14:56:02 -07:00
yuneng-jiang
1dbf46665e
test: make custom_tokenizer proxy tests hermetic (#29643)
test_custom_tokenizer_bug.py loaded Xenova/llama-3-tokenizer from
HuggingFace Hub at test time, so it flaked on shared CI runners whenever
HF returned 429 Too Many Requests; the surfaced LocalEntryNotFoundError
made it look like a connectivity bug.

Rewrite the suite to mock the one network boundary
(litellm.utils.Tokenizer.from_pretrained) while running the proxy's real
extraction-and-selection path. The regression test now asserts the
configured identifier from model_info.custom_tokenizer actually reaches
from_pretrained and that the response reports the huggingface tokenizer,
which the previous llama-3-named test could not distinguish from the
default path. A control test pins the no-custom-tokenizer case to the
OpenAI tokenizer with from_pretrained asserted unused.

Verified by reintroducing the original bug (model_info left unpopulated
from the deployment): the regression test fails (from_pretrained called 0
times) while the control stays green.
2026-06-04 12:51:37 -07:00
Mateo Wang
9344f205a8
fix(proxy): add default=None to LiteLLM_TeamMembership.litellm_budget_table (#29684)
In Pydantic v2, Optional[T] without a default is a required field. Any
row with budget_id=null triggered a validation error and returned 401.

Co-authored-by: Florent Chenebault <florent.chenebault@lifen.fr>
2026-06-04 12:13:11 -07:00
Sameer Kankute
cb041966bf
Litellm oss staging 040626 (#29671)
* fix(azure): apply api_version fallback chain to image edit URL

`AzureImageEditConfig.get_complete_url` only read `api_version` from
`litellm_params`. When callers configured it via `litellm.api_version`
or `AZURE_API_VERSION`, the constructed URL had no `?api-version=` and
Azure responded `404 Resource not found`.

Apply the same fallback chain the Azure chat path already uses in
`common_utils.py`:

    litellm_params > litellm.api_version > AZURE_API_VERSION env >
    litellm.AZURE_DEFAULT_API_VERSION

Adds 5 unit tests pinning each layer of the chain plus a regression
guard for `api_base` that already carries `?api-version=`.

* feat(mcp): core sampling and elicitation flow with security hardening

- Add sampling_handler.py: full MCP sampling/createMessage flow with
  model selection (hint-based + priority-based), auth enforcement,
  budget checks, route restriction gates, and tag policy pre-auth
- Add elicitation_handler.py: MCP elicitation/create relay with
  downstream client capability detection
- Wire sampling/elicitation callbacks in mcp_server_manager.py
  gated behind allow_sampling/allow_elicitation config flags
- Add allow_sampling/allow_elicitation fields to MCPServer type
- Fix session lock deadlock: skip lock for JSON-RPC response POSTs
  (elicitation/sampling replies) with truncated-body heuristic
- Extend client.py with sampling_callback and elicitation_callback
- Security: RouteChecks gate, tag-budget bypass fix, x-forwarded-for
  spoofing fix, Latin-1 header encoding guard
- Add 4 new test modules (model access, priority selection, request
  builder, tool conversion) + update existing MCP tests

* fix(security): run pre-call guardrails before MCP sampling acompletion

Without this, an upstream MCP server with allow_sampling enabled could
send prompts that bypass every guardrail (content filtering, PII
redaction, prompt-injection detection) configured on /chat/completions.

- Call proxy_logging_obj.pre_call_hook(call_type='acompletion') before
  llm_router.acompletion so guardrails fire for sampling sub-calls
- Add HTTPException to the re-raise list so guardrail rejections
  propagate correctly instead of being swallowed as generic errors

* feat(bedrock_mantle): add Responses API support (/openai/v1/responses) (#29490)

* feat(bedrock_mantle): add Responses API transformation config

* test(bedrock_mantle): cover trailing-slash api_base normalization

* feat(bedrock_mantle): export BedrockMantleResponsesAPIConfig

* feat(bedrock_mantle): register gpt-5.x Responses config (gpt-oss unchanged)

* feat(bedrock_mantle): add gpt-5.5/gpt-5.4 Responses price-map entries

* refactor(bedrock_mantle): exclude gpt-oss instead of allow-listing gpt-5 for Responses routing

Frontier OpenAI models on Bedrock Mantle are Responses-only on /openai/v1/responses;
gpt-oss is the legacy family that also speaks chat-completions. Gate by excluding
gpt-oss (which keeps its chat-completions emulation) and defaulting everything else
to the native Responses config, so future frontier models (gpt-6, etc.) route
correctly without a code change. Verified against the live us-east-2 Mantle endpoint:
gpt-oss 400s on /openai/v1/responses while gpt-5.5 400s on both standard paths.

* test(bedrock_mantle): cover supports_native_websocket opt-out

Closes the one uncovered line flagged by codecov on the Responses config.
The assertion documents that Mantle Responses has no realtime/websocket
transport, so realtime routing must not attempt a socket it cannot serve.

* fix(bedrock_mantle): route file_search through emulation instead of forwarding to Mantle

BedrockMantleResponsesAPIConfig inherited supports_native_file_search()
-> True from OpenAIResponsesAPIConfig but never overrode it. Mantle has no
OpenAI vector stores, so a forwarded file_search tool is rejected with a
400 (verified upstream: Tool type 'file_search' is not supported). Opting
out, like the existing supports_native_websocket override, routes the tool
through LiteLLM's file_search emulation instead.

* fix(bedrock_mantle): only route openai.gpt frontier models to Responses

The previous gate excluded gpt-oss and routed every other model to the
native Responses config. But on Mantle only the OpenAI gpt frontier models
(gpt-5.x) are served on /openai/v1/responses; gpt-oss and the non-OpenAI
families (nvidia, mistral, google, zai, ...) are chat-completions only and
400 on that path. Allow-list the openai.gpt- family (excluding gpt-oss)
instead, so chat-only models fall through to the chat-completions emulation.
Verified against the live us-east-2 endpoint: nvidia.nemotron-nano-9b-v2
returns 400 on /openai/v1/responses and 200 on /v1/chat/completions.

* feat(custom_llm): allow streaming/astreaming to yield ModelResponseStream (#27580)

* fix(custom_llm): allow streaming/astreaming to yield ModelResponseStream directly

* fix(streaming): enhance ModelResponseStream handling for custom LLM providers

* fix(streaming): strip finish_reason from content chunks and ensure tool_calls are preserved

* fix(streaming): add type ignore for finish_reason assignment in CustomStreamWrapper

* fix(proxy): strip stack trace from HTTP 503 responses (CWE-209) (#28330)

* fix(proxy/cwe-209): strip Python traceback from HTTP 503 error responses

The /cache/ping endpoint included a full Python traceback in its 503 error
response body (inside the ProxyException message), leaking internal file
paths, line numbers, and call stacks to any caller. Two MCP route handlers
in proxy_server.py similarly interpolated str(e) into "Internal server
error" detail strings.

Fix: log the traceback server-side via verbose_proxy_logger.exception()
and omit it from the ProxyException payload / HTTPException detail returned
to clients. Tests updated to assert no "traceback" keyword or frame paths
appear in the 503 body, with a new dedicated regression test.

CWE-209: Generation of Error Message Containing Sensitive Information.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(proxy/cwe-209): apply Greptile P2 fixes and add MCP exception-path tests

Greptile 4/5 review identified two remaining gaps and Codecov reported
0% coverage on the two MCP handler exception branches:

1. caching_routes.py — str(e) in "Service Unhealthy ({str(e)})" could
   still leak Redis hostnames/IPs; replaced with static "Service Unhealthy".
   HTTPException is now re-raised before the generic handler so the
   "cache not initialized" 503 still reaches callers with its detail.
   Removed the redundant str(e) arg from verbose_proxy_logger.exception()
   (exception() already appends the traceback automatically).

2. tests — two new unit tests cover the exception paths in
   dynamic_mcp_route and toolset_mcp_route that were previously at 0%:
   - test_dynamic_mcp_route_unexpected_exception_returns_500_without_traceback
   - test_toolset_mcp_route_unexpected_exception_returns_500_without_traceback

All 25 tests pass (9 caching + 16 MCP).

CWE-209: Generation of Error Message Containing Sensitive Information.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(caching_routes): restore precise assertion in test_cache_ping_no_cache_initialized

The assertion was weakened to `"Cache not initialized" in str(data)`, which
matches the raw string of the entire response dict and would pass even if the
error moved to an unexpected field or changed structure.

Restore a targeted check on the parsed response: assert the exact string in
the correct field `data["detail"]`, matching FastAPI's HTTPException
serialisation format {"detail": "<message>"}.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(caching_routes): restore precise assertion and add CWE-209 no-cache path test

The assertion in test_cache_ping_no_cache_initialized was weakened to
`"Cache not initialized" in str(data)`, which matched against the raw string
representation of the entire response dict. This would pass silently even if
the error message moved to an unexpected field or the structure changed.

Restore a targeted assertion on the parsed field:
  assert data["detail"] == "Cache not initialized. litellm.cache is None"
matching FastAPI's HTTPException serialisation format exactly.

Add test_cache_ping_no_cache_does_not_expose_internals to show the code path
is still working correctly after the CWE-209 fix: verifies that the HTTPException
is re-raised as-is (no traceback, no source paths), and asserts the complete
response structure is exactly {"detail": "Cache not initialized. litellm.cache is None"}.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(caching_routes): restore ProxyException envelope for null-cache 503

The except HTTPException: raise guard (added in the CWE-209 fix) caused
the null-cache HTTPException to escape as FastAPI's {"detail": "..."} shape
instead of the {"error": {...}} ProxyException envelope that callers expect.

Move the null-cache guard before the try block and raise ProxyException
directly so the response structure is consistent with all other /cache/ping
503s, and the except HTTPException: raise guard is only reachable by
unexpected downstream HTTPExceptions.

Update the two no-cache tests to assert the correct ProxyException envelope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Update utils.py (#26609)

* feat(pricing): add Snowflake Cortex REST API model pricing (#26612)

* feat(pricing): add Snowflake Cortex REST API model pricing

## Summary

Adds pricing and context window information for 20+ Snowflake Cortex REST API models to `model_prices_and_context_window.json`.

## What's included

- **7 Claude models** (sonnet-4-5, sonnet-4-6, 4-sonnet, 4-opus, haiku-4-5, 3-7-sonnet, 3-5-sonnet) — with prompt caching rates
- **4 OpenAI models** (gpt-4.1, gpt-5, gpt-5-mini, gpt-5-nano) — with prompt caching rates  
- **5 Llama models** (3.1-8b, 3.1-70b, 3.1-405b, 3.3-70b, 4-maverick)
- **1 DeepSeek model** (deepseek-r1)
- **1 Mistral model** (mistral-large2)
- **1 Snowflake model** (snowflake-llama-3.3-70b)
- **2 Embedding models** (arctic-embed-l-v2.0, arctic-embed-m-v2.0)

Each entry includes `input_cost_per_token`, `output_cost_per_token`, `cache_read_input_token_cost` (where applicable), `max_input_tokens`, `max_output_tokens`, and capability flags (`supports_function_calling`, `supports_vision`, `supports_prompt_caching`, `supports_reasoning`).

## Pricing source

All prices are in USD per token, sourced from the official [Snowflake Service Consumption Table](https://www.snowflake.com/legal-files/CreditConsumptionTable.pdf) — Tables 6(b) (REST API with Prompt Caching) and 6(c) (REST API).

## Context

The existing `snowflake/` provider has zero model entries in the pricing JSON, which means LiteLLM cannot track costs for Snowflake Cortex calls. This PR fills that gap.

## Related

- Existing provider: `litellm/llms/snowflake/`
- Cortex REST API docs: https://docs.snowflake.com/en/user-guide/snowflake-cortex/cortex-rest-api

* Update model_prices_and_context_window.json

Fix the JSON parsing error

* Update model_prices_and_context_window.json

Removed the duplicate entry

* fix(utils): copy extra_body before adding unknown params to prevent model config mutation (#29620)

Fixes #29615. In add_provider_specific_params_to_optional_params, the line:

    extra_body = passed_params.pop("extra_body", None) or {}

returns the original dict reference when extra_body is non-empty (truthy).
Subsequent writes like extra_body[k] = passed_params[k] then mutate the
shared model config object held by the router, poisoning /model/info and
all subsequent requests for that deployment.

The or {} short-circuit creates a new dict only when extra_body is falsy
(None or {}), which is why the bug does not reproduce with extra_body: {}.

Fix: wrap in dict() so we always work on a fresh shallow copy.

* fix(vertex_ai): Bake tool_choice into Gemini CachedContent body to prevent silent drop (#29097)

* fix(vertex_ai): bake tool_choice into Gemini CachedContent body to prevent silent drop

* address greptile feedback on tool_choice cache test

* adds test that uses ToolConfig(functionCallingConfig=FunctionCallingConfig(mode=ANY)) instead of a dict literal, mirroring what map_tool_choice_values actually produce

* fix(gemini/veo): move image from parameters into instances[0] (#29501)

* fix(gemini/veo): move image from parameters into instances[0]

Veo's predictLongRunning schema puts image (and prompt) on the
instances element; parameters is for aspectRatio/durationSeconds/etc.
The Gemini path was leaving image in params_copy, so it ended up
nested under parameters and the API silently ignored it.

The Vertex path already builds the instance dict explicitly, so this
just aligns the Gemini path with it.

Fixes #29498

* address greptile: unconditional pop + BytesIO test

- Pop `image` from params_copy unconditionally so it never reaches
  GeminiVideoGenerationParameters even when None, removing implicit
  reliance on Pydantic's extra-field-ignore.
- Add test_transform_video_create_request_image_filelike_goes_to_instance
  covering the BytesIO path (_convert_image_to_gemini_format) — round-trips
  the base64 to confirm encoding.
- Add test_transform_video_create_request_image_none_is_dropped covering
  the new None branch.

* fix(huggingface): handle special token text in embedding usage (#29660)

* fix(guardrails): recompile ToolPermissionGuardrail rules on update_in_memory_litellm_params (#29655)

* fix(guardrails): recompile ToolPermissionGuardrail rules on update_in_memory_litellm_params

ToolPermissionGuardrail builds self.rules and the compiled target/pattern
maps only in __init__. The base update_in_memory_litellm_params re-sets raw
attributes via setattr but never rebuilds those maps, so a guardrail updated
in place (PUT /guardrails, or the immediate in-memory sync) keeps enforcing
the construction-time rules until it is reinitialized (PATCH path, periodic
DB poll, or restart).

Extract the compile step into _load_rules and override
update_in_memory_litellm_params to rebuild from it (dict- and model-safe),
re-normalizing default_action / on_disallowed_action. Mirrors the existing
PresidioGuardrail override of the same method. Adds regression tests.

Fixes #29592.

* fix(guardrails): handle dict params in ToolPermissionGuardrail in-memory update

Delegate to super() only for LitellmParams input (the base setattr loop is
model-only); apply the raw-dict case inline. Fixes the mypy arg-type error
and makes the recompile work when the proxy passes the raw DB dict.

* fix(guardrails): preserve tool-permission rules on a partial in-memory update

A partial update (e.g. a LitellmParams whose rules field is None) ran through
the generic setattr, which set self.rules to None, and the recompile was
skipped, leaving the guardrail with no rules. Snapshot the previous rules and
restore them when the update carries no rules; an explicit empty list still
clears them. Adds a regression test for the rules-absent case.

Addresses the Greptile review note on #29655.

* fix(bedrock): stop base_model label from stripping tools/tool_choice (#29621)

* fix(bedrock): stop base_model label from stripping tools/tool_choice

A Router/proxy Bedrock deployment whose model_info.base_model is a friendly
label (e.g. claude-haiku-4-5) silently lost tools/tool_choice: the outgoing
Converse request was built without toolConfig, so the model behaved as if no
tools were provided. Worked in v1.84.0, regressed in v1.85.0, and with
drop_params=true it failed silently.

Two changes compound into the bug. completion() passed model_info.base_model
as the model argument to get_optional_params, so the real Bedrock model id
never reached supported-param resolution; and get_supported_openai_params
resolved the provider config's params from base_model or model, letting the
label fully replace the real model. For Bedrock the label resolves to no tool
support, so tools/tool_choice were dropped before transformation.

completion() now keeps model as the real deployment model and threads the
resolved base_model (kwarg or model_info) through separately, and
get_supported_openai_params treats base_model as additive: it returns the
union of the params supported by model and by base_model. A hint can only add
capabilities, never strip ones the real model already exposes, which also
preserves the original base_model behavior from #27717 and Azure's base_model
driven model-type detection.

Fixes #29618

* test(main): make base_model param test robust to new parametrize cases

Restore an explicit per-case expected_model_param literal instead of
hardcoding the gemini id, so a future case with a different model can't
produce a misleading assertion failure.

* fix(fireworks_ai): pass response_format json_schema through unchanged (#29606)

FireworksAIConfig.map_openai_params was rewriting the OpenAI strict
`{type: json_schema, json_schema: {name, strict, schema}}` shape into
`{type: json_object, schema: ...}` before sending to Fireworks, dropping
`strict` and `name` and changing the `type`. Per Fireworks' docs json_object
means "force any valid JSON output (no specific schema)", so the schema
constraint was effectively dropped and grammar-guided decoding never ran;
model output silently violated the schema.

The rewrite landed in #7085 (Dec 2024) when Fireworks did not yet accept
native json_schema. Fireworks accepts the OpenAI strict shape natively now,
so the rewrite has become a regression.

Removes the rewrite. Passes response_format through unchanged. Updates the
existing test_map_response_format to assert pass-through. Adds focused
regression tests in tests/test_litellm/ covering preservation of type,
strict, name, and schema body, plus that json_object alone still works.

* fix(types): import Required from typing_extensions in gemini types

* style: reformat sampling_handler.py for py312 black compat

* refactor(mcp-sampling): extract helpers to fix PLR0915 too-many-statements in handle_sampling_create_message

* fix(proxy-server): add explicit ProxyLogging type annotation to proxy_logging_obj to fix mypy inference

* fix(mcp-sampling): suppress mypy assignment error on ImportError fallback for proxy_logging_obj

* fix(test): use .value when comparing LlmProviders enum against string in test_default_api_base

* fix(test): iterate LlmProviders enum in test_default_api_base to avoid str pollution from custom provider registration

litellm.provider_list is a mutable global initialized to list(LlmProviders) but custom_llm_setup() appends plain provider strings to it. When a test_custom_llm.py test runs first in the same xdist worker, provider_list contains a str and calling .value on it raises AttributeError. Iterate the immutable LlmProviders enum instead, which is deterministic and what the check intends.

* fix(mcp): depth-aware JSON-RPC response detection and neutral speed-priority fallback

Replace the flat substring check in the truncated-body routing path with a
top-level-key scan so a JSON-RPC response whose result payload nests a
"method" field is still detected as a response and skips the session lock,
removing a deadlock against the in-flight tool call awaiting it.

Drop the inverse max_output_tokens speed proxy when no model exposes
output_tokens_per_second; context-window size does not track latency, so a
neutral score avoids biasing speedPriority toward the smallest-context model.

* fix(guardrails): make ToolPermission rule reload atomic on invalid regex

_load_rules appended each rule to self.rules before compiling its regex, so an
invalid pattern raised mid-loop after the bad rule was already live but without
a _compiled_rule_targets entry. _matches_regex reads a missing compiled target
as a None pattern and returns True, turning the bad rule into a match-all that
silently applies its decision to every tool. Via update_in_memory_litellm_params
(PUT /guardrails) this corrupted the live guardrail.

Build the parsed rules and compiled maps into locals and swap them in only after
every regex compiles, and restore the previous ruleset if a live update is
rejected, so an invalid regex now fails the update without leaving the guardrail
enforcing a broken policy.

* test(mcp): cover sampling conversion, model resolution, and elicitation relay paths

The MCP sampling and elicitation handlers shipped with partial test
coverage, leaving the response-to-MCP conversion, the model resolution
fallback chain, completion-kwargs assembly, guardrail routing, and the
entire elicitation relay untested. That pulled the PR's diff (patch)
coverage below the codecov threshold even though overall project
coverage rose.

Add focused unit tests for _convert_openai_response_to_mcp_result,
_convert_mcp_tools_to_openai, _convert_mcp_tool_choice_to_openai, image
and audio content conversion, the hint-matching and fallback branches of
_resolve_model_from_preferences, _build_completion_kwargs, the router and
guardrail-rejection paths of _run_guardrails_and_call_llm, the
handle_sampling_create_message success and error-propagation flows, the
marker-hoisting fallback for tool content on unexpected roles, and the
elicitation form/url/generic relay together with its decline paths

---------

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: lengkejun <lengkejun@xd.com>
Co-authored-by: Yug <yugborana000@gmail.com>
Co-authored-by: Kent <72616338+kingdoooo@users.noreply.github.com>
Co-authored-by: tanmay958 <53569547+tanmay958@users.noreply.github.com>
Co-authored-by: DrishnaTrivedi <142084770+DrishnaTrivedi@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Navnit Shukla <Navnit.shukla25@gmail.com>
Co-authored-by: PRABHU KIRAN VANDRANKI <72809214+VANDRANKI@users.noreply.github.com>
Co-authored-by: Adrian Lopez <109683617+adriangomez24@users.noreply.github.com>
Co-authored-by: hcl <chenglunhu@gmail.com>
Co-authored-by: JooHo Lee <96564470+BWAAEEEK@users.noreply.github.com>
Co-authored-by: Dinesh Girbide <85330597+Dinesh-Girbide@users.noreply.github.com>
Co-authored-by: cloudwiz <22098246+andrey-dubnik@users.noreply.github.com>
Co-authored-by: Ahmad Khan <ahmadkhan2508@gmail.com>
Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>
2026-06-04 11:07:20 -07:00
Sameer Kankute
ed073d382d
fix(gemini-realtime): use GA event names for Pipecat 1.3.x compatibility (#29662)
* fix(gemini-realtime): use GA event names for Pipecat 1.3.x compatibility

Pipecat v1.3.0 adopted the OpenAI Realtime API GA event naming:
  response.audio.delta          -> response.output_audio.delta
  response.text.delta           -> response.output_text.delta
  response.audio.done           -> response.output_audio.done
  response.text.done            -> response.output_text.done

The proxy was still emitting the old beta names; Pipecat's
`parse_server_event` raises "Unimplemented server event type" for any
unknown type, which killed the receive task handler and broke audio
playback and tool-call delivery.

Also:
- conversation.item.created -> conversation.item.added (already handled)
- client audio is buffered until backend setupComplete in deferred mode
- call_id fallback UUID when Gemini returns empty id
- status_details / token detail fields added to Pydantic-strict events

The _GA_TO_BETA_EVENT_TYPES map in RealTimeStreaming already translates
GA names back to beta for clients that opt in with the openai-beta
header, so legacy clients are unaffected.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(gemini-realtime): address greptile review comments

- emit outputTranscription as response.output_audio_transcript.delta
  instead of suppressing it; GA_TO_BETA map handles translation for
  legacy clients
- cap pre-setup audio buffer at 200 frames to prevent memory exhaustion;
  log a warning when the limit is hit and additional frames are dropped
- log remaining dropped message count on flush error

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(gemini-realtime): address veria review comments

- remove unused OpenAIRealtimeConversationItemCreated import
- fix guardrail bypass: semantic_vad early-return now preserves
  create_response when set so a guardrail-injected create_response:false
  is not silently dropped
- add per-connection 10 MB byte cap alongside the 200-frame count cap
  for the pre-setup audio buffer to prevent memory exhaustion

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(gemini-realtime): fix mypy arg-type on _finalize_gemini_live_setup

setup parameter typed as BidiGenerateContentSetup to match the TypedDict
passed at both call sites; was dict which mypy rejected.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(gemini-realtime): widen _finalize_gemini_live_setup to Dict[str, Any]

BidiGenerateContentSetup (TypedDict) is a subtype of Dict[str,Any] so
both call sites (one passing a plain dict, one passing the TypedDict)
satisfy mypy.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(gemini-realtime): cast BidiGenerateContentSetup to Dict at _finalize call site

mypy rejects TypedDict as dict[str, Any] argument; cast at the call site
where follow_up_setup is BidiGenerateContentSetup to satisfy the checker.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix Gemini realtime beta compatibility

* Fix deferred Gemini setup audio ordering

* fix: preserve Gemini audio transcript ids

* fix(realtime): cap pre-setup client buffer on all append paths

Route every append to the deferred-setup pending buffer through the
per-connection message/byte caps. Previously only the audio-buffer
fast path enforced the caps; once one frame was buffered, a client
that withheld session.update could stream arbitrary frames into
_pending_messages_until_setup unbounded and exhaust proxy memory.

* style(gemini-realtime): apply black formatting to transformation.py

* fix(gemini-realtime): log beta-translation fallback and name native-audio marker

Surface the previously swallowed exception in _send_event_to_client so a
failed GA->beta translation is observable instead of silently forwarding the
untranslated event. Extract the native-audio model substring used by
_finalize_gemini_live_setup into a named constant documenting why speechConfig
is dropped on those setups.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>
2026-06-04 08:03:24 -07:00
Sameer Kankute
20dc6dffa4
fix(proxy): passthrough 404 when SERVER_ROOT_PATH is set (#29658)
* fix(proxy): match passthrough registry routes bare-to-bare with SERVER_ROOT_PATH

After #28547, get_request_route strips the deployment prefix while registry
lookup still re-inflated stored paths via SERVER_ROOT_PATH, causing 404s
under paths like /llmproxy/ml. Compare normalized bare routes in both
is_registered_pass_through_route and get_registered_pass_through_route.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(proxy): patch utils.get_server_root_path in passthrough auth tests

After removing get_server_root_path from pass_through_endpoints, route
and JWT tests must mock litellm.proxy.utils where normalization reads it.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-04 07:44:51 -07:00
Sameer Kankute
216c68db04
fix(gemini): googleSearch + server-side tools and googleMaps JSON schema (#29582)
* fix(gemini): keep googleSearch with server-side tools and googleMaps JSON schema

Wire include_server_side_tool_invocations through completion() so mixed
google_search and function tools are not dropped on Gemini 3+. Rewrite
generationConfig to responseFormat when googleMaps is used with JSON schema.

Fixes #27479
Fixes #29451

Co-authored-by: Cursor <cursoragent@cursor.com>

* address greptile review feedback (greploop iteration 1)

* style: fix black formatting in main.py for py312 compat

* Fix Gemini Google Maps extra_body JSON rewrite

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-04 07:43:30 -07:00