4ec4ab99d0
39572 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
4ec4ab99d0
|
feat(mcp): per-server env vars with global + per-user scopes (#28917) | ||
|
|
53cf3d8416
|
fix(proxy): drop deleted team BYOK model name from team.models (#29820)
Deleting a team-scoped BYOK model left its public name in team.models, so /models with a team key kept listing the now-deleted "ghost" model. delete_model stripped team.models using only litellm_modeltable alias lookups, but models added via /model/new with a team_id never create an alias row; their public name lives only in team.models and model_info.team_public_model_name, so it was never removed. The team cache was also left stale because the delete path skipped _refresh_cached_team. The cleanup now keys off team_public_model_name (falling back to alias keys), runs after the deployment row is deleted, and strips a public name only when no remaining team deployment still backs it, so a load-balanced replica is not revoked and concurrent deletes cannot leave a ghost. The updated team row is refreshed in cache so /models reflects the change immediately |
||
|
|
e53bd7cbd1
|
feat(ui): generate dashboard API types from the proxy OpenAPI spec (#29816)
* feat(ui): generate dashboard API types from the proxy OpenAPI spec Introduces the shared type foundation for the dashboard without touching any runtime code. The proxy's FastAPI app is the source of truth; app.openapi() emits the spec and openapi-typescript turns it into src/lib/http/schema.d.ts. Adds an npm run gen:api script (a Python spec dump piped into openapi-typescript) and a Check UI API Types Sync CI job that regenerates the file from the live spec and fails if it drifts, so the committed types can never silently fall out of step with the backend. The generated file is pinned to openapi-typescript 7.13.0 and excluded from prettier, eslint, and knip, and marked linguist-generated so it collapses in diffs. No openapi-fetch and no call-site changes yet; this only makes the types exist. * chore(ui): tidy gen-api-types script per review Write the spec dump inside a with-block and clean up the temp dir in a finally, so repeated local runs don't leave stray ~MB JSON files behind. |
||
|
|
b7f47a3b52
|
fix(jwt): use resolved DB user_id for spend on legacy email match (#29217)
* fix(jwt): attribute spend to resolved DB user_id on email/sso fuzzy match When user_id_upsert is enabled with JWT auth and a pre-migration user row exists whose user_email matches the JWT email but whose user_id is a UUID, get_user_object resolves the legacy row via fuzzy lookup, but the JWT-claim user_id (the email) still flowed into team-membership lookup, JWTAuthBuilderResult.user_id, UserAPIKeyAuth and the spend tables. Spend was orphaned under a phantom email id; /user/info and the Usage page showed $0 for the legacy user (GH #26789). Treat the resolved user_object as the source of truth: add _canonical_user_id_from_db, rebind inside get_objects, and return effective_user_id so auth_builder unpacks it without adding statements. Fixes #26789 Co-authored-by: Cursor <cursoragent@cursor.com> * fix(jwt): log user_id rebind at DEBUG to avoid email PII in INFO streams Greptile review on #29217: rebinding often logs JWT email claims at INFO. Co-authored-by: Cursor <cursoragent@cursor.com> * test(jwt): update passthrough allowlist mock for 5-tuple get_objects Staging #29256 added a test that still mocked get_objects with a 4-tuple; our PR expanded the return to 5 values (effective_user_id). Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> |
||
|
|
95e3d136e1
|
test(google): add google-genai SDK proxy integration tests (#29781)
* test(google): add google-genai SDK proxy integration tests for Gemini and Vertex Pin google-genai in the CI dependency group and exercise streaming/non-streaming generate_content through the LiteLLM proxy in the existing unified_google_tests suite. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(test): address Greptile review for google-genai proxy SDK tests Restore GOOGLE_APPLICATION_CREDENTIALS after the module proxy fixture tears down, initialize temp-file tracking on the proxy SDK base class, and skip litellm reload for proxy_genai_sdk tests so the module-scoped proxy server stays consistent. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(test): only load Vertex credentials when keys exist for proxy SDK tests Avoid writing empty GOOGLE_APPLICATION_CREDENTIALS temp files so Vertex tests skip cleanly without credentials, use a session-scoped proxy fixture, and clean up per-test credential temp files. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(test): scope google-genai pin to unified_google_tests only Remove google-genai from the ci dependency group and pin it in tests/unified_google_tests/requirements.txt for local test installs. Co-authored-by: Cursor <cursoragent@cursor.com> * test(google): tie litellm reload skip to proxy fixture dependency Replace the name-based reload guard with a check on whether the test requests the google_genai_proxy_url fixture, so the skip stays correct if the proxy SDK tests are renamed. * fix(test): stop DatabaseURLSettings tests leaking DATABASE_URL into os.environ The autouse env scrubber relied on monkeypatch.delenv, but apply_to_env writes DATABASE_URL straight into os.environ, which monkeypatch never tracks and therefore never undoes. The synthesized writer.example.com URL leaked past the last test in this module and into proxy-infra tests that read DATABASE_URL to decide whether to hit a real database, e.g. test_deprecated_key_grace_period_cache_hit_path, turning an intended skip into a ConnectError. Snapshot and restore the managed vars directly so the original environment is reinstated regardless of how it was mutated. * test(google): drop redundant per-test vertex credential setup The session-scoped google_genai_proxy_url fixture already configures GOOGLE_APPLICATION_CREDENTIALS before the proxy starts, and _require_proxy_sdk skips when credentials are missing, so the per-test _setup_vertex_credentials_if_needed helper and its temp-file tracking never did any work. Remove it to keep the ABC self-contained. * test(google): declare model_config contract on proxy SDK ABC _skip_reason_if_credentials_missing reads self.model_config to pick the provider, but that property was only declared on the sibling BaseGoogleGenAITest. Make the dependency explicit by adding model_config as an abstract property on BaseGoogleGenAIProxySDKTest so the ABC is self-contained and a standalone subclass fails fast instead of hitting an AttributeError. * test(google): narrow streaming error catch to Exception Catching BaseException in the streaming assertion swallowed KeyboardInterrupt and SystemExit, turning a Ctrl-C into a test failure message instead of letting pytest interrupt cleanly. Only genuine runtime errors should be recorded as stream failures, so catch Exception. * test(google): initialize proxy on the same loop that serves it The proxy was initialized via asyncio.run() on the main thread, which creates and tears down a throwaway event loop, while requests were served on a separate loop in the worker thread. Any asyncio primitive bound to the init loop would be unusable once serving started. Run initialize() on the worker thread's loop right before server.serve() so setup and request handling share a single event loop. * test(google): drop redundant google-genai requirements pin google-genai>=1.37.0,<2.0 is already declared in the proxy-runtime extra, which the google_generate_content_endpoint_testing CI job installs via uv sync --all-extras. The standalone tests/unified_google_tests/requirements.txt duplicated that pin with a narrower ==1.37.0 specifier and was never installed by CI, so it added a second source of truth without changing what gets installed. Drop it and rely on the proxy-runtime extra. * chore: revert incidental uv.lock exclude-newer bump The google-genai ci pin was added and then dropped (it is already provided by the proxy-runtime group), but each uv lock recomputed the relative exclude-newer span, leaving only a timestamp bump in uv.lock. Restore it to the base value so this test-only PR carries no lockfile change. --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com> |
||
|
|
d671a09c20
|
Litellm oss staging 050626 (#29774)
* Mark xAI models retiring on 2026-05-15 (#28788) Per https://docs.x.ai/developers/migration/may-15-retirement, xAI is retiring the following slugs on 2026-05-15 (auto-redirect to grok-4.3 with various reasoning efforts; callers continuing to use the old slugs will be billed at grok-4.3 pricing): grok-4-1-fast-reasoning{,-latest} -> grok-4.3 (low effort) grok-4-1-fast-non-reasoning{,-latest} -> grok-4.3 (none) grok-4-fast-reasoning -> grok-4.3 (low effort) grok-4-fast-non-reasoning -> grok-4.3 (none) grok-4-0709 -> grok-4.3 (low effort) grok-code-fast-1{,-0825} -> grok-build-0.1 grok-3 -> grok-4.3 (none) Only the direct xai/ slugs are tagged; third-party hosts (azure_ai, oci, vercel_ai_gateway, perplexity/xai) run their own schedules. The grok-3 retirement list explicitly names only the base grok-3 slug — the -mini / -fast / -beta / -latest variants are not listed, so they remain untouched. * feat(moonshot): advertise json_schema response support on live models (#29683) litellm.responses() already routes Moonshot through the responses->chat-completions bridge, and Moonshot honors response_format json_schema on chat completions. The cost-map entries left supports_response_schema unset, so discovery layers that gate on that flag dropped Moonshot from structured-output / responses listings even though the capability works end to end. Set supports_response_schema on the nine models currently live on api.moonshot.ai: kimi-k2.5, kimi-k2.6, the moonshot-v1 8k/32k/128k text and vision-preview variants, and moonshot-v1-auto. Verified against the live API that each honors json_schema and that litellm.responses() returns schema-valid structured output through the bridge. * chore(moonshot): mark models retired from api.moonshot.ai as deprecated (#29685) Thirteen Moonshot/Kimi models in the cost map no longer resolve on api.moonshot.ai (all return 404). Stamp each with its deprecation_date from platform.kimi.ai/docs/models rather than deleting the entries, so historical cost calculation keeps resolving the names while tooling can surface the retirement. Dates: kimi-thinking-preview 2025-11-11; kimi-latest and its 8k/32k/128k context variants 2026-01-28; the kimi-k2 preview/turbo/thinking series 2026-05-25; the moonshot-v1 -0430 snapshots use their own 2024-04-30 snapshot date (Moonshot publishes no discontinuation date for them). * fix(moonshot): drop temperature for reasoning models (kimi-k2.5/k2.6) (#29687) Kimi reasoning models reject every temperature except 1; a request with temperature=0.2 returns "invalid temperature: only 1 is allowed for this model". litellm only clamped temperature into [0.3, 1], so any value below 1 still 400'd. Drop the temperature param entirely for reasoning models (gated on supports_reasoning, the same signal transform_request already uses) so the model default is used; the non-reasoning moonshot-v1 models keep the existing clamp. Co-authored-by: Sameer Kankute <sameer@berri.ai> * feat(mcp): add per-server timeout configuration (#29672) * feat(mcp): add per-server timeout configuration * fix(mcp): address timeout field review comments - use is not None guard instead of or for 0.0 edge case - copy timeout in both LiteLLM_MCPServerTable constructions (health check path + _build_mcp_server_table) - add timeout Float? column to all three schema.prisma files - extend round-trip test to cover _build_mcp_server_table direction - add test for zero timeout not treated as falsy * fix(mcp): forward timeout in _build_temporary_mcp_server_record * fix(mcp): return 504 instead of 500 when per-server timeout fires * test(mcp): add 504 timeout regression test; fix black formatting * Add jp. Bedrock cross-region inference profile for claude-opus-4-7 (#28567) * fix(thinking): handle None thinking param in is_thinking_enabled (#28598) Squash-merged by litellm-agent from Terrajlz's PR. * feat(helm): support tpl rendering in podAnnotations (#28609) Squash-merged by litellm-agent from devauxbr's PR. * Forward custom_llm_provider through the Responses API bridge (Fixes #28505) (#28575) * Forward custom_llm_provider through the Responses API bridge (Fixes #28505) When a Chat Completions request to a GPT-5.4+ model contains both `tools` and `reasoning_effort`, `completion()` auto-routes through `responses_api_bridge`. The bridge handler called `litellm.responses()` / `litellm.aresponses()` without forwarding the already-resolved `custom_llm_provider`, so the downstream call re-invoked `get_llm_provider()` with `custom_llm_provider=None` and stripped a second provider prefix from a `provider/provider/model` deployment string. For a deployment configured as `openai/openai/openai/gpt-5.5`, the bridge flow sent `openai/gpt-5.5` to the upstream API instead of the correct `openai/openai/gpt-5.5`. Upstream APIs that enforce model-name allow-lists rejected this as `key_model_access_denied`. Fix: pass the locally-resolved `custom_llm_provider` into both the sync `responses()` and async `aresponses()` calls so the downstream `_resolve_model_provider_for_responses` sees an explicit provider and skips the second prefix-strip. New regression test `tests/test_litellm/completion_extras/test_responses_bridge_provider_propagation.py` pins both call sites: each must forward `custom_llm_provider`. * fix(28505): set custom_llm_provider on request_data instead of as duplicate kwarg Greptile flagged that the previous patch passed custom_llm_provider as an explicit kwarg to responses()/aresponses() while request_data already carried it via the spread of sanitized_litellm_params, which would raise TypeError: got multiple values for keyword argument on every real bridge call. Switches to assigning request_data['custom_llm_provider'] before the call so the resolved provider wins over whatever sanitized_litellm_params spread in, without duplicating the kwarg. Updates the regression test to seed request_data with a sentinel custom_llm_provider so it actually exercises the overwrite path (the previous test mocked transform_request with a minimal dict and never hit the conflict). * chore: trigger shin-agent re-eval on retargeted staging base * chore: trigger shin-agent re-eval against updated Greptile state * Add jp. Bedrock cross-region inference profile for claude-opus-4-7 AWS Bedrock documents jp.anthropic.claude-opus-4-7 alongside the existing us./eu./au./global. profiles for Claude Opus 4.7 (ap-northeast-1 Tokyo / ap-northeast-3 Osaka), but the entry is missing from model_prices_and_context_window.json. Tokyo-region users currently get an "unknown model" error when routing through the JP geo profile. Adds the entry to both the canonical file and the bundled backup, mirroring the recent pattern for sonnet-4-6 (#27831). Pricing matches the other regional profiles (10% premium over base/global). Regression test pins all six documented profiles (base, global, us, eu, au, jp) and asserts pricing parity between jp. and au. variants. Source: https://docs.aws.amazon.com/bedrock/latest/userguide/model-card-anthropic-claude-opus-4-7.html --------- Co-authored-by: Terrajlz <info@jouleselectrictech.com> Co-authored-by: Bruno Devaux <devaux.br@gmail.com> Co-authored-by: Sameer Kankute <sameer@berri.ai> * feat(soniox): add soniox audio transcription integration (#29508) * feat(openmeter): add OPENMETER_TRUST_REQUEST_USER to prevent forged attribution (#29650) The OpenMeter callback resolves the CloudEvent subject from kwargs["user"] first, then falls back to the key-bound user_api_key_user_id. For multi-tenant proxy deployments, a client can set `"user": "..."` in the request body and cause their usage to be attributed to that arbitrary string — a billing-attribution forgery risk. Adds OPENMETER_TRUST_REQUEST_USER env var (default "true" for backward compatibility). When set to "false", the request-supplied `user` field is ignored and the subject is resolved solely from user_api_key_user_id. Matches the existing env-var-driven config pattern in this file (OPENMETER_API_KEY, OPENMETER_API_ENDPOINT, OPENMETER_EVENT_TYPE). * feat(search): add you_com as a search provider (#28370) * feat(search): add you_com as a search provider Registers You.com Search API as a first-class `search_provider` in the `search_tools` registry, alongside Tavily, Exa, Perplexity, etc. - New adapter: litellm/llms/you_com/search/transformation.py - POSTs to https://ydc-index.io/v1/search - Auth: X-API-Key from YOUCOM_API_KEY (or explicit api_key) - Maps Perplexity unified spec: max_results -> count, search_domain_filter -> include_domains, country -> country - Flattens results.web + results.news into a single SearchResult list; snippet prefers snippets[0], falls back to description; page_age -> date - Registry: SearchProviders.YOU_COM in litellm/types/utils.py and wired into ProviderConfigManager.get_provider_search_config() - Pricing entry: model_prices_and_context_window.json (placeholder $0.0; happy to adjust to maintainers' preferred public number) - Docs: example router config snippet and example proxy yaml updated - Tests: tests/search_tests/test_you_com_search.py - 5 mocked tests (payload shape, domain filter mapping, snippet fallback, news flattening, missing-api-key error) Refs upstream expansion signal: #15942 * review fixups: normalize api_base, lowercase country, scope env-var to test Addresses Greptile inline review comments on #28370: - get_complete_url: strip trailing slashes from api_base *before* the endswith("/v1/search") check, so a custom base like ".../v1/search/" doesn't become ".../v1/search/v1/search". - transform_search_request: .lower() country before sending, matching Tavily's convention so callers using the unified spec form ("US") get consistent behavior across providers. - Tests: replace direct os.environ writes with an autouse monkeypatch fixture so YOUCOM_API_KEY is set per-test and removed afterwards. The missing-key test now uses monkeypatch.delenv. New test asserts the trailing-slash normalization above. Reverts the ARCHITECTURE.md / example yaml edits per the reviewer note that documentation changes belong in the litellm-docs repo. * support keyless free tier (api.you.com/v1/agents/search) as default You.com offers an IP-throttled keyless endpoint that returns the same response shape as the keyed one (~100 queries/day, no signup). This is a significant onboarding lever - mirrors the keyless DuckDuckGo/SearXNG providers already in the search_tools registry. Behavior: - YOUCOM_API_KEY set -> keyed: POST https://ydc-index.io/v1/search (X-API-Key header) - no key -> free: POST https://api.you.com/v1/agents/search (no auth) - YOUCOM_API_BASE override -> honored as-is Tests: - New: test_you_com_search_keyless_free_tier - asserts URL + absence of X-API-Key when no key is configured. - New: test_you_com_search_validate_environment_keyless - asserts the config no longer raises when the key is absent. - Removed: test_you_com_search_raises_without_api_key (the precondition no longer holds). - Existing payload/domain-filter/etc tests still cover keyed mode via the autouse YOUCOM_API_KEY fixture. Verified both endpoints accept POST + return identical JSON shape: results.web[] / results.news[] with title, url, snippets, description, page_age. * register you_com in provider_endpoints_support.json Adding `litellm/llms/you_com/` requires a corresponding entry in provider_endpoints_support.json or the code-quality/check_provider_folders_documented CI check fails. Follows the compact tavily/serper pattern - endpoints: { search: true }. Local run of the check now reports "All 114 provider folders are documented". * move tests under tests/test_litellm/llms/ so CI exercises them The litellm CI workflows scope unit tests to `tests/test_litellm/...` (see test-unit-llm-providers.yml: `tests/test_litellm/llms` path), so tests living under `tests/search_tests/` are never run in CI - which is why codecov reports 0% patch coverage for the new adapter even though the unit tests exist and pass locally. Move test_you_com_search.py into `tests/test_litellm/llms/you_com/` so the test-unit-llm-providers job picks it up. 7/7 tests still pass at the new location. (Sibling search-only providers - tavily, exa_ai, brave, etc. - still live only in `tests/search_tests/` and would benefit from the same move, but that is out of scope for this PR.) * fix(you_com): pin Accept-Encoding: identity to dodge keyless gzip bug The keyless free-tier endpoint (api.you.com/v1/agents/search) advertises Content-Encoding: gzip but returns a body that httpx's decoder rejects with `zlib.error: Error -3 while decompressing data: incorrect header check`, surfacing as litellm.APIConnectionError in user code. curl works because it doesn't request compression by default. Pin Accept-Encoding: identity in validate_environment so the upstream server skips compression entirely. Harmless on the keyed endpoint (ydc-index.io/v1/search) which negotiates content-encoding correctly. The header uses setdefault so a caller-supplied Accept-Encoding still takes precedence. (Server-side bug has been flagged to the You.com team separately - once fixed there, this workaround can be removed.) New unit test: test_you_com_search_pins_identity_accept_encoding. --------- Co-authored-by: Sameer Kankute <sameer@berri.ai> * docs: fix README typo (#29419) Correct clear spelling mistakes in documentation without changing behavior. Confidence: high Scope-risk: narrow Tested: git diff --check; uvx codespell on changed files Not-tested: Full docs build not run; text-only changes * Fix(langfuse): pass httpx_client to Langfuse in langfuse_prompt_management to respect SSL_VERIFY (#29480) * fix(langfuse): pass ssl_verify to Langfuse httpx client * fix_langfuse_ * add unit tests * addressed comments --------- Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> * feat(models): add minimax/MiniMax-M3 to model cost map (#29412) Add MiniMax's new flagship MiniMax-M3 to the native minimax provider: 512K context, 128K max output, native multimodal (supports_vision), reasoning, prompt caching. Pricing (USD/M tokens): input 0.6 / output 2.4 / cache read 0.12. M3 has no active prompt-cache-write tier, so cache_creation_input_token_cost is omitted. Updated both the root model_prices_and_context_window.json (remote source) and the bundled litellm/model_prices_and_context_window_backup.json (local fallback), keeping them in sync. * fix(logging): handle ResponseCompletedEvent in anthropic_messages streaming spend log (#29394) * fix(logging): handle ResponseCompletedEvent in anthropic_messages streaming spend log * fix(logging): extend terminal event handling to ResponseIncompleteEvent and ResponseFailedEvent; fix return type annotation * feat(provider): Add Neosantara provider as OpenAI Compatible (#29646) * Add Neosantara provider * Register Neosantara provider enum * Address Neosantara provider review feedback * Add Neosantara packaged endpoint support --------- Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> * fix: address greptile and veria review feedback - langfuse: guard httpx_client injection behind version check (>= 2.7.3) - soniox: propagate audio_transcription_duration in _hidden_params for spend tracking - soniox: give SONIOX_API_BASE env var priority over caller-supplied api_base - mcp: replace CancelledError catch with asyncio.wait_for + TimeoutError * chore(mcp): add migration for per-server timeout column * fix(test): add tool_use_system_prompt_tokens to model prices schema validator * fix: mcp timeout test uses real asyncio.wait_for timeout; you_com get_complete_url respects resolved api_key * fix: forward resolved api_key into you_com endpoint selection and apply timeout to soniox polling GETs The search flow resolves api_key in validate_environment but never passed it into get_complete_url, so a programmatic api_key (with no YOUCOM_API_KEY in the env) set the X-API-Key header yet still selected the keyless free-tier endpoint. Forward api_key through both the search entrypoint and the http handler so the keyed endpoint is chosen. HTTPHandler.get/AsyncHTTPHandler.get had no timeout parameter, so the Soniox poll and transcript-fetch GETs silently used the client global default instead of the caller timeout. Add a per-request timeout to get() and forward the configured timeout from the Soniox handler. * fix(soniox): price stt-async-v4 per second so transcriptions are billed The handler stores audio_transcription_duration in _hidden_params, but the model carried only token cost fields and the response has no token usage, so the transcription cost path fell through to cost_per_second and returned $0. An authenticated caller could transcribe Soniox audio without decrementing their budget. Switch the entry to output_cost_per_second at Soniox's published $0.10/hour async rate so the stored duration produces a real charge. * fix(langfuse): use a dedicated httpx client for the SDK injection The httpx_client handed to the Langfuse SDK came from _get_httpx_client(), which returns LiteLLM's globally cached HTTPHandler. If Langfuse closed that client on teardown it would invalidate the shared client used by every other LiteLLM HTTP call. Build a dedicated httpx.Client instead, still resolving SSL verification and client certificate from LiteLLM's configuration. * fix(soniox): prefer caller-supplied api_base over SONIOX_API_BASE env var * fix(cohere): support max_completion_tokens on cohere v2 chat (default route) (#29779) * fix(cohere): support max_completion_tokens on cohere v2 chat The default cohere_chat route resolves to CohereV2ChatConfig, which did not list or map max_completion_tokens, so get_optional_params raised UnsupportedParamsError for the standard OpenAI parameter (the modern replacement for the deprecated max_tokens). The v1 config already maps it to cohere's max_tokens; mirror that in v2 and add v2 regression tests. * fix(cohere): make max_completion_tokens take precedence over max_tokens on v2 When both max_tokens and max_completion_tokens are supplied, prefer max_completion_tokens explicitly rather than relying on dict iteration order, and cover both orderings with a regression test. --------- Co-authored-by: Daniel Yudelevich <4537920+yudelevi@users.noreply.github.com> Co-authored-by: hectorc98 <hector.chamorroalvarez@adyen.com> Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com> Co-authored-by: Terrajlz <info@jouleselectrictech.com> Co-authored-by: Bruno Devaux <devaux.br@gmail.com> Co-authored-by: Dan Lemon <dan@danlemon.com> Co-authored-by: Saswat <saswatds@users.noreply.github.com> Co-authored-by: Brian Sparker <brainsparker@users.noreply.github.com> Co-authored-by: Zhao73 <156770117+Zhao73@users.noreply.github.com> Co-authored-by: Urain Ahmad Shah <60431964+urainshah@users.noreply.github.com> Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: kape <168134658+kapelame@users.noreply.github.com> Co-authored-by: danisalvaa <159898202+danisalvaa@users.noreply.github.com> Co-authored-by: Just R <remixingmagelang@gmail.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com> Co-authored-by: abhay23-AI <abhaytrivedi22@gmail.com> |
||
|
|
4a5644d51e
|
refactor(ui): centralize proxy base URL resolution into tested resolver (#29793)
* refactor(ui): centralize proxy base URL resolution into tested resolver The API base URL join logic was hand-rolled inside networking.tsx and re-derived inline at hundreds of call sites, with no test coverage and a latent double-slash bug when the base carried a trailing slash. This pulls the join into a single pure resolveApiBase() with full unit coverage and routes the existing resolution through it, also de-duplicating the env precedence ladder that was copied in two places. * test(ui): assert root-path redirect joins prefix exactly once The existing toContain check accepts a doubled separator; tighten it to a strict prefix match plus a no-double-slash assertion so a regression in the resolveApiBase origin+SERVER_ROOT_PATH join is caught end-to-end. |
||
|
|
a4f57032e0
|
fix(ui): route MCP playground auth by oauth2 mode instead of token_url (#29714)
Interactive PKCE and OBO servers were mislabeled as M2M, so passthrough never showed the Authorize gate; classify by oauth2_flow + delegate_auth_to_upstream instead. |
||
|
|
84247d954d
|
test(ci): record/replay OpenAI image gen so the spend E2E isn't outage-bound (#29787)
* test(ci): record/replay OpenAI image gen so the spend E2E isn't outage-bound The dockerized spend test test_key_info_spend_values_image_generation curls the proxy for a gpt-image-1 image, which wildcard-routes to real api.openai.com on every commit; an OpenAI outage then reddens unrelated PRs and each run pays for an image. Add an in-repo record/replay reverse proxy (tests/_openai_record_replay_proxy.py) that sits between the proxy and OpenAI. The first run, and the first after the recording lapses, records live; subsequent runs replay from the shared Redis cassette store. The proxy keeps its real separate-process HTTP topology; only the image model's api_base is pointed at the recorder in CI via IMAGE_GEN_RECORDER_BASE_URL, which is unset elsewhere so it falls back to api.openai.com. Recordings lapse 24h after write and are never refreshed on read, matching the VCR persister contract, so provider drift is still caught. Replayed responses drop upstream framing/server headers (content-length, transfer-encoding, content-encoding, date, server) so the re-serving layer recomputes them, honoring the Bedrock content-length lesson. * test(ci): close recorder http client on app shutdown Add a Starlette lifespan that closes the self-created httpx.AsyncClient on teardown, and leave caller-injected clients untouched so reuse across create_app calls is not broken. Covers the unclosed-client ResourceWarning raised in review. |
||
|
|
939cff0455
|
test(vcr): stop refreshing cassette TTL on read so cassettes lapse after 24h (#29784)
The Redis cassette persister slid the 24h TTL forward on every successful read, so any cassette replayed at least once per day never expired. With CI running more than once a day that means a recorded response is replayed forever and the suite never re-hits the provider, so a changed request or response contract goes undetected indefinitely. Drop the refresh-on-read. The TTL now counts down from the last write, so a cassette lapses 24h after it was recorded and the next run past that point re-records live and catches provider drift. Per-commit runs in between still replay from cache; only the one boundary-crossing run goes live. |
||
|
|
074455c138
|
fix(auth): expand all-team-models sentinel in can_key_call_model for batch validation (#29746)
* fix(auth): expand all-team-models sentinel in can_key_call_model Keys with models=["all-team-models"] were denied during batch JSONL model validation because can_key_call_model matched the literal string against the model name. Add _resolve_key_models_for_auth_check to expand the sentinel to team_models before the check, consistent with get_key_models in model_checks.py and the completion-route bypass. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(auth): document empty team_models unrestricted access behavior; add regression test Adds a docstring note to _resolve_key_models_for_auth_check explaining that when team_models is empty, all-team-models resolves to [] which is treated as unrestricted access (consistent with get_key_models behavior on other auth paths). Adds a test to lock in this behavior. * fix(auth): deny all-team-models access when key has no team_id A key configured with models=["all-team-models"] but no team_id could previously resolve to an empty allowlist, which _check_model_access_helper treats as unrestricted access. Now the sentinel is only expanded when team_id is set; otherwise the unresolved sentinel stays in the model list and causes a deny (no real model name matches it). Same fix applied to get_key_models in model_checks.py for consistency across batch and non-batch auth paths. * style: black format model_checks.py * Fix batch all-team-models auth * style: black format batch_rate_limiter.py * fix(test): add tool_use_system_prompt_tokens to model prices schema validator * fix(batch): catch get_team_object errors to avoid 404 escaping batch auth * fix(batch): apply per-member model scope check after team auth in batch validation * Fail closed on batch team auth fetch errors * test(batch): cover team_object grant and member-scope denial in batch auth --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com> |
||
|
|
89f177b7b6
|
fix(galileo): use ingest traces API and standard logging payload (#29651)
* fix(galileo): use ingest traces API and standard logging payload
Switch hosted Galileo logging to /ingest/traces with nested trace/span payloads, read metrics from standard_logging_object, and include cost and total tokens on trace metrics.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(galileo): route username/password auth to v2 traces ingest
Hosted Galileo no longer serves /observe/ingest; JWT login should post the same trace payload to /v2/projects/{project_id}/traces.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(galileo): address Greptile review on logging and timestamps
Use debug-level logs for per-request Galileo callback messages and fall back to start_time/end_time when standard_logging_object omits startTime/endTime.
Co-authored-by: Cursor <cursoragent@cursor.com>
* feat(galileo): add Galileo to proxy UI callback configuration
Expose Galileo in the admin callback selector and config APIs so credentials can be configured through the dashboard instead of YAML only.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(galileo): align response type logging with Langfuse
Mirror Langfuse input/output handling for rerank, speech, transcription,
realtime, pass-through, and other response types so Galileo ingest no longer
skips supported call types.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(galileo): redact trace payload in debug logs and format with black
Avoid logging prompts and model responses in flush debug output while
keeping structural metadata for troubleshooting.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(galileo): stop logging full trace payload in debug output
Log only flush URL and trace count so prompts and model responses are not
written to application logs when debug logging is enabled.
Co-authored-by: Cursor <cursoragent@cursor.com>
* Fix Galileo token totals and prompt messages
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
|
||
|
|
ffd0e9fa7f
|
[internal copy of #27491] fix(realtime): Fix Realtime Audio Token Cost Tracking (#29722)
* Normalize Realtime usage dict keys before ResponseAPIUsage transform * Test usage transform for Realtime versus tokens_details keys * Avoid usage_input dict in-place * Fix audio cost calculation * fix(responses): forward output audio_tokens into completion usage details Pass audio_tokens from output_tokens_details into CompletionTokensDetailsWrapper so cost can use output_cost_per_audio_token. Support dict output details like prompt path. Extend tests for Realtime and mixed completion audio. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix audio token usage formatting * style: Black-format Realtime usage and completion usage merge Resolve combine_usage_objects and responses/utils wrapping for CI black --check. Restore model_fields comments above completion_tokens_details merge loop. Co-authored-by: Cursor <cursoragent@cursor.com> * Add test to cover combined usage objects * Fix merge conflict with test cases Removed unnecessary import statement and cleaned up assertions in test. * fix(cost_calculator): remove dead None guard in completion_tokens_details combiner --------- Co-authored-by: Liam McDonald <lmcdonald@godaddy.com> Co-authored-by: Cursor <cursoragent@cursor.com> |
||
|
|
3f79222350
|
fix(proxy): persist oauth2_flow on MCP server registration (#29690) | ||
|
|
1c741b91c0
|
fix(anthropic): route Claude Opus 4.8 through adaptive thinking (#29702)
* fix(anthropic): route Claude Opus 4.8 through adaptive thinking Opus 4.8 uses the same adaptive thinking contract as 4.6/4.7 (thinking.type=adaptive plus output_config.effort), but _is_adaptive_thinking_model only recognized 4.6/4.7 by name and otherwise leaned on the supports_adaptive_thinking cost-map flag. The Bedrock, Vertex, and Azure 4.8 entries don't carry that flag, so a bedrock/us.anthropic.claude-opus-4-8 request fell back to the legacy thinking.type=enabled shape and Bedrock rejected it with "thinking.type.enabled is not supported for this model". Add _is_claude_4_8_model and wire it in next to the existing 4.6/4.7 matchers in the adaptive-thinking detection, the effort=max gate, and the supported-params check, so every provider path treats 4.8 as adaptive regardless of whether its cost-map entry advertises the flag. * refactor(anthropic): drive Opus 4.8 adaptive thinking from the cost map Replace the _is_claude_4_8_model name matcher with cost-map data. Add supports_adaptive_thinking to every Opus 4.8 provider variant (Bedrock regional/global, Vertex, Azure) in both the root and bundled cost maps, and move the prefix-resolving capability lookup (_supports_model_capability) down to AnthropicModelInfo so _is_adaptive_thinking_model reads the flag through the bedrock/invoke/, bedrock/, and vertex_ai/ prefixes. The 4.6/4.7 name checks stay as a fallback since their provider entries don't carry the flag yet. A pure data fix is not enough on its own: _supports_factory doesn't strip the us.anthropic./invoke/ prefixes, so bedrock/invoke/us.anthropic.claude-opus-4-8 would still miss the flag without the resolver change. Add a cost-map guardrail test asserting every claude-opus-4-8 variant carries the flag, so a future variant added without it fails CI instead of silently sending the legacy thinking.type=enabled shape that the provider rejects. |
||
|
|
8259d6cd85
|
fix: small CLAUDE.md nit (#29749) | ||
|
|
778a7f752d
|
Support OAuth M2M for Databricks Apps A2A agents (#29586)
* Add OAuth M2M support for A2A agents targeting Databricks Apps Databricks App endpoints reject static bearer tokens and require a short-lived OAuth token minted via the workspace OIDC token endpoint. A2A agents could previously only authenticate outbound with static_headers or client header passthrough, so Databricks App agents could not be registered. Agents configured with a databricks_oauth block in litellm_params now mint and cache a client_credentials token and attach it as the outbound Authorization header on both message/send and message/stream calls, overriding any statically configured Authorization. * Add tests covering Databricks App OAuth token error paths Cover the HTTP status error, transport error, non-object JSON body, and invalid expires_in fallback branches in the token cache so the failure handling is locked in by regression tests. * Harden Databricks App OAuth token cache Cap the cache TTL at the token's own lifetime so a token whose validity is shorter than the refresh buffer is never cached and served stale; include a digest of client_secret in the cache key so a rotated secret mints a fresh token instead of reusing the old one; and prune the per-key lock when its cached token is evicted so the lock map stays bounded by the live key set. * Clear per-key locks on Databricks OAuth cache flush * fix(a2a/databricks): mint OAuth token via Basic auth header, not unsupported auth= kwarg litellm's AsyncHTTPHandler.post (what get_async_httpx_client returns) has no auth parameter, so minting a Databricks App OAuth token raised "AsyncHTTPHandler.post() got an unexpected keyword argument 'auth'" before any network call ever left the proxy, breaking the feature end to end. The handler also calls raise_for_status() internally and re-raises a MaskedHTTPStatusError (a subclass of httpx.HTTPStatusError), so the explicit raise_for_status() after post() was dead code. Build the HTTP Basic Authorization header by hand and pass it via headers, which is what the Databricks workspace OIDC token endpoint documents for client authentication. The token-cache tests now model the real handler contract with create_autospec so the rejected auth= signature is enforced; the previous mocks accepted any kwargs and silently hid the bug. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * Prune Databricks OAuth lock on the short-lived-token path When expires_in is below the refresh buffer the token is intentionally not cached, so _remove_key never runs for that key and the per-key lock created by _get_lock leaked permanently. Drop the lock in that branch so _locks stays bounded by the live key set, and assert the cleanup in the short-lived-token test * Gate A2A Databricks OAuth on the databricks_oauth block at the call site Make the gating explicit where the header is applied so it is clear that only agents configured with a databricks_oauth block enter the OAuth path; every other agent is left untouched. Add a regression test asserting a non-Databricks agent never invokes the token resolver. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> |
||
|
|
2b7c97bff6
|
fix(vertex/anthropic): handle namespace tools and strip client_metadata for codex compatibility (#29489)
* fix(vertex/anthropic): handle namespace tools and strip client_metadata for codex compatibility * fix(anthropic): cast nested namespace tools to fix mypy error, skip nameless flat tools |
||
|
|
df704d9016
|
fix(proxy/hooks): populate llm_provider on internal rate-limit errors (#27707)
* feat(proxy/hooks): add ProxyHTTPRateLimitError + provider resolver
Introduces a small helper layer used by every proxy-side rate-limit
hook so that the 429 they raise carries a populated llm_provider /
model — instead of an empty exception.llm_provider that downstream
loggers (Prometheus failure metric, observability callbacks) read as
'no provider attribution'.
ProxyHTTPRateLimitError inherits from both fastapi.HTTPException
(so the proxy server still renders it as a 429) and
litellm.exceptions.RateLimitError (so isinstance checks and
PrometheusLogger._get_exception_class_name pick up llm_provider).
We deliberately don't call RateLimitError.__init__ — it constructs
an httpx.Response we don't need and would just add failure surface;
attribute parity is what downstream consumers care about.
resolve_llm_provider_for_rate_limit() wraps litellm.get_llm_provider
defensively. Internal limiter hooks fire from async_pre_call_hook —
well before get_llm_provider runs anywhere else in the request
lifecycle — so we have to call it ourselves at raise time. If the
model is missing or unparseable (alias, router-only model) we fall
back to llm_provider='litellm_proxy' rather than letting a second
exception leak out and break the request path.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* fix(proxy/hooks): populate llm_provider on parallel-request 429s
Both v1 and v3 parallel-request limiters fired bare HTTPException(429)
from inside async_pre_call_hook. The downstream Prometheus failure
metric reads exception.llm_provider via _get_exception_class_name —
the empty value showed up as exception_class='HTTPException' and
left model_id='None' on the time series.
Threads requested_model through every raise site in:
* parallel_request_limiter.py:
- check_key_in_limits (the per-key/per-model/per-user/per-team/
per-customer over-limit path)
- raise_rate_limit_error (zero-limit + global_max_parallel_requests
paths) — now takes an optional requested_model kwarg
* parallel_request_limiter_v3.py:
- _handle_rate_limit_error (the OVER_LIMIT translator), called
from both the should_rate_limit pre-check and the TPM
reservation path
Resolved via resolve_llm_provider_for_rate_limit so unknown / missing
models silently fall back to llm_provider='litellm_proxy' instead of
breaking the request path with a second exception.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* fix(proxy/hooks): populate llm_provider on dynamic-rate-limit 429s
Same plumbing change as the parallel limiters, applied to both
dynamic_rate_limiter (v1) and dynamic_rate_limiter_v3:
* v1: TPM-zero and RPM-zero paths in async_pre_call_hook now resolve
data['model'] -> (model, llm_provider) once and pass it into both
raises.
* v3: All three raise sites in _check_rate_limits — the
model_saturation_check enforced raise, the priority_model
enforced raise, and the fail-closed unknown-descriptor branch —
now attribute the 429 to the actual provider.
Falls back to llm_provider='litellm_proxy' when the model can't be
resolved.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* fix(proxy/hooks): populate llm_provider on batch-rate-limit 429s
batch_rate_limiter._raise_rate_limit_error now takes a
requested_model kwarg threaded from data['model'] in
_check_and_increment_batch_counters. The batch-creation 429 is what
gets raised when the input file's tokens/requests count would push
the per-key TPM/RPM window over its limit.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* fix(proxy/hooks): populate llm_provider on budget/iterations 429s
Final batch of internal raise sites — the user/session-budget and
max-iterations hooks. Same pattern: resolve data['model'] once at
raise time, attach to ProxyHTTPRateLimitError so Prometheus and
observability callbacks can attribute the 429.
Hooks updated:
* max_budget_limiter (per-user max_budget exceeded)
* max_iterations_limiter (per-session agent iteration cap)
* max_budget_per_session_limiter (per-session dollar cap)
All three fall back to llm_provider='litellm_proxy' when data['model']
is missing or unparseable. Drops the now-unused HTTPException import
from each module.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* test(proxy/hooks): pin provider field on internal rate-limit 429s
Regression coverage for the 'provider field missing' bug across every
proxy-side rate-limit hook + the helper layer:
* ProxyHTTPRateLimitError class shape (HTTPException + RateLimitError,
dict-detail stringification, None-provider normalization).
* resolve_llm_provider_for_rate_limit happy paths
(gpt-4o-mini, anthropic/..., bedrock/...) plus all three fallback
branches (None, '', unknown name) plus a 'get_llm_provider raises'
case that asserts we swallow the secondary exception.
* For each limiter (parallel v1/v3, dynamic v1/v3, batch,
max_budget, max_iterations, max_budget_per_session): assert the
raised exception is a RateLimitError carrying the resolved
model + llm_provider, and a sibling test that asserts the
fallback path returns 'litellm_proxy' without leaking a second
exception.
* Two PrometheusLogger._get_exception_class_name pins so the
Prometheus failure metric label flips from 'HTTPException' to
'Openai.ProxyHTTPRateLimitError' (or 'Litellm_proxy.*' on
fallback) — that's what dashboards consume.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* perf(proxy/hooks): defer provider resolution to over-limit branches
* fix: use error_message in raise_rate_limit_error to avoid literal 'None' in detail
* Consolidate rate_limiter_utils imports in dynamic_rate_limiter
* fix(proxy): set num_retries/max_retries on ProxyHTTPRateLimitError
ProxyHTTPRateLimitError inherits from RateLimitError but did not call
RateLimitError.__init__, so num_retries/max_retries were never set.
When Starlette's HTTPException lacks __str__, MRO falls through to
RateLimitError.__str__, which unconditionally reads these attributes
and raises AttributeError during logging/traceback formatting.
Initialize them to None defensively.
* fix(mypy): silence base-class status_code conflict on ProxyHTTPRateLimitError
HTTPException declares 'status_code: int' while openai.RateLimitError
(via APIStatusError) declares 'status_code: Literal[429] = 429'. Mypy
flags the multi-base override as [misc] in CI lint. The runtime semantics
are fine (we set self.status_code in __init__), so silence the
class-level annotation conflict with a targeted ignore.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
---------
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
|
||
|
|
812a2217ca
|
[internal copy of #29511] feat(guardrails): add sensitive data routing to on-premise models (#29531)
* feat(guardrails): add sensitive data routing to on-premise models When a guardrail detects sensitive data, route to an on-premise model instead of blocking or redacting. All subsequent requests in that session continue routing to the same model (sticky routing). New config options for guardrails: - on_sensitive_data: 'block' (default) or 'route' - sensitive_data_route_to_model: target model for rerouting - sticky_session_routing: persist routing for session (default: true) New exception SensitiveDataRouteException triggers rerouting when raised by guardrails. The proxy catches it, stores the routing decision in cache, and modifies the request's model field. New hook _PROXY_SensitiveDataRoutingHandler checks incoming requests against cached routing decisions and applies sticky routing. https://claude.ai/code/session_01SQd4isBa3UyouRoGVou9dK * fix: black formatting for custom_guardrail.py https://claude.ai/code/session_01SQd4isBa3UyouRoGVou9dK * test: improve test coverage for sensitive data routing feature Add additional tests for: - Cache key format and TTL constants - Session ID extraction from multiple locations - Custom guardrail initialization with routing config - Exception string representation and custom messages - Redis cache paths including fallback behavior - Edge cases in pre-call hook https://claude.ai/code/session_01SQd4isBa3UyouRoGVou9dK * fix: use correct GuardrailRaisedException parameters Replace invalid 'source' parameter with 'guardrail_name' to match the exception's actual signature. https://claude.ai/code/session_01SQd4isBa3UyouRoGVou9dK * test: move sensitive data routing tests to hooks directory Move test file to align with source code structure. https://claude.ai/code/session_01SQd4isBa3UyouRoGVou9dK * fix(guardrails): honor sticky_session_routing flag and scope session routing per API key Propagate sticky_session_routing through SensitiveDataRouteException so a guardrail configured with sticky_session_routing=False reroutes only the triggering request without persisting a session override. Scope the routing cache key to the requesting API key so sessions from different tenants cannot collide, and warn when sticky routing is requested but the hook is not registered. * refactor(guardrails): dedupe session-id extraction and drop redundant import Extract the shared session-id lookup into get_session_id_from_request_data so the sensitive-data routing hook and CustomGuardrail no longer keep two identical copies of the logic. Remove the redundant local import of GuardrailRaisedException in handle_sensitive_data_detection, and document that detection_info is surfaced in request metadata and logs so it must not carry raw sensitive values. * fix(guardrails): guard None user_api_key_dict in sensitive data route handler * fix(responses): send application/json Content-Type on responses DELETE OpenAI's responses DELETE endpoint now rejects requests that arrive without a Content-Type header, defaulting them to application/octet-stream and returning 'Unsupported content type: application/octet-stream'. The delete handler sent no body and therefore no Content-Type, so the request failed. Declare application/json on the delete request, matching the OpenAI SDK. * fix(guardrails): backfill in-memory cache after redis hit in sensitive data routing When _get_routed_model resolves a routing override from Redis it now also populates the local in-memory cache. Without the write-back, a non-writing instance that only ever reads from Redis would lose the sticky routing decision the moment Redis became unavailable, silently reverting sensitive sessions to the default model. * fix(guardrails): scope sticky sensitive-data routing to JWT principal Keyless auth (JWT and similar) has no api_key, so every such caller shared the "default" cache namespace. One authenticated user could reuse another user's session_id, trip the guardrail, and silently force the other user's subsequent requests onto the cached on-prem model for the TTL. Resolve the routing tenant from the api_key when present, otherwise from a stable principal built from the user/team/org identity, before reading or writing the session route. * fix(guardrails): require route target model when on_sensitive_data='route' * fix(guardrails): mark user_api_key_dict Optional in sensitive-data route handler * fix(guardrails): use remaining redis ttl for local backfill and str env default * fix(guardrails): graceful block when routing configured but no session_id handle_sensitive_data_detection promised to raise only SensitiveDataRouteException or GuardrailRaisedException, but when routing was configured and the request had no session_id it let a ValueError from raise_sensitive_data_route_exception propagate, surfacing as an HTTP 500 instead of a block. Fall back to a graceful block in that case so the documented contract holds. * fix(guardrails): run remaining guardrails after sensitive-data reroute Defer the SensitiveDataRouteException until every guardrail in the pre-call loop has run, so downstream security guardrails are no longer skipped when an earlier guardrail triggers routing. The first reroute wins and a later guardrail that blocks still propagates. Also normalize on_sensitive_data to lowercase like sibling on_* config fields so case-insensitive values are accepted. * fix(guardrails): classify sensitive-data reroute as guardrail intervention * fix(guardrails): record sensitive-data reroute as prometheus intervention not error * fix(guardrails): record service span for routing guardrail and move case-normalizer to base params Drop the early continue so a guardrail that signals sensitive-data routing still emits its PROXY_PRE_CALL service span like every other callback. Move the lowercase normalizer onto BaseLitellmParams so on_sensitive_data is normalized consistently when BaseLitellmParams is constructed directly, matching the cross-field route->model validator that already lives on the base. |
||
|
|
56aa55b991
|
fix(proxy): stop team BYOK model name corruption on model edit (#29731)
* fix(proxy): stop team model name corruption on edit (#28382) (#29001) Team-scoped ("Team-BYOK") models store an internal routing key model_name_{team_id}_{uuid} in the model_name column and the user-facing name in model_info.team_public_model_name. The internal name leaked into /v1, /v2, and /model/info responses; the dashboard bound its edit form to it, so any non-rename save (e.g. a TPM tweak) PATCHed the internal name back. The update path then treated it as a rename, overwriting team_public_model_name and rewriting the team's models[] ACL with the mangled string -- breaking team key calls with team_model_access_denied. Two-layer fix: - Read path (root cause): add _translate_model_name_for_response and apply it in model_info_v2 and _get_proxy_model_info so /v1, /v2, and /model/info surface the public name for team-scoped rows. The DB column and router index keep the internal name as the routing key; this is a presentation-layer swap on a shallow copy (never mutates input). - Write path (defense in depth): harden _get_public_model_name so a value matching the internal shape, or a no-op against the current DB column, is never treated as a rename -- for both the top-level model_name and an explicit model_info.team_public_model_name. Tests: regression for the reported scenario, full branch coverage of _get_public_model_name, two internal-shape guard cases, an end-to-end PATCH through _update_team_model_in_db (asserts the team ACL is untouched), and four response-translation cases. 60 passed (model management), 181 passed (proxy server). * fix(ui): key Agent Builder agent selection on model_info.id (#29729) * fix(ui): key Agent Builder agent selection on model_info.id Once team-scoped BYOK models can share a public name (the backend now returns the public name on /model/info instead of the internal routing key), selecting agents by model_name collides. Key selection, create, update and delete on the stable model_info.id instead, falling back to model_name only for config-defined agents that have no id. * fix(ui): add name-match fallback to post-create agent selection If the just-created agent's id is not yet present in the re-fetched list, try matching by name before falling back to the first agent. Addresses greptile review on #29729. --------- Co-authored-by: tushar8408 <32977767+tushar8408@users.noreply.github.com> |
||
|
|
f3811ce63b
|
refactor(ui): shared HTTP client + location-pinned fetch() lint rule (#29723)
* refactor(ui): add shared HTTP client and pin raw fetch() to one file Introduce src/lib/http/client.ts, a single typed wrapper that owns the only fetch() in the dashboard. It centralizes the base URL, the auth header, error parsing (deriveErrorMessage), non-2xx -> thrown ApiError, and JSON parsing, and is framework-agnostic (no React) so it can run from client and, later, server components. The base URL, auth header name and the logout side effect are injected through createApiClient. networking.tsx builds one configured apiClient and the 29 functions whose boilerplate maps exactly to the client's default behavior (canonical deriveErrorMessage + handleError + res.json() template) now call it instead of hand-rolling fetch. Names, signatures, return types and error behavior are unchanged; this is a pure refactor that drops ~440 lines. The no-restricted-syntax fetch rule now points at the client and a files: ["src/lib/http/**"] override makes that the only place fetch() is allowed. Re-baselined eslint-suppressions.json: networking.tsx fetch suppressions drop 270 -> 241; no other rule's counts change. The remaining networking.tsx fetches and the ~61 scattered component/hook fetches diverge from the default client behavior (text() error bodies, no res.ok check, no handleError side effect) and stay grandfathered for a follow-up burndown. * fix(ui): make the HTTP client tolerate non-JSON error bodies The non-2xx branch parsed the error body with response.json(), so a gateway returning HTML (502/503 from a reverse proxy) threw a SyntaxError before onError fired or ApiError was built, dropping the user-facing notification. This matched the old per-function behavior, but the client is now the single error path so it is the right place to harden. Read the body as text once, try JSON.parse for the existing deriveErrorMessage path, and fall back to the raw text (or the HTTP status) otherwise. The success path stays strict json() so return types are unchanged. * fix(ui): await the returned apiClient promise in 6 migrated functions The codemod rendered the `return response.json()` tail as `return apiClient.x()` without `await`. Inside the surrounding try/catch that returns an unawaited promise, so the catch never runs and its console.error log is dropped on failure; 4 of the 6 were `return await response.json()` originally, so this restores their exact behavior. Use `return await apiClient.x()` in all six. * refactor(ui): widen onError type and handle empty success bodies Address review notes on the shared client. Type onError as (message: string) => void | Promise<void> so the fire-and-forget async contract (networking passes the async handleError) is explicit rather than silently discarded by void. On the success path, read the body as text and return undefined for an empty body (e.g. a 204 No Content) instead of throwing a SyntaxError, while still parsing non-empty bodies strictly so a malformed JSON response surfaces rather than being masked. Add tests for the 204 case. |
||
|
|
3bd89f209e
|
Litellm jwt mapping virtualkeys (#28510)
* restore an explicit no-match policy * fix(jwt): fix AUTO_REGISTER sentinel bypass, race condition, and inline import comment - AUTO_REGISTER now evicts stale __NO_MAPPING__ sentinel instead of silently returning None when cached under a prior fallback_team_mapping config - Race condition in _auto_register_jwt_mapping: catch P2002 unique-constraint violation on concurrent creates, fetch the winning mapping, proceed cleanly - Added comment on inline generate_key_helper_fn import explaining the circular dependency (key_management_endpoints imports user_api_key_auth at line 51) - 3 new tests: stale sentinel eviction, race condition winner fallback, and the existing auto_register happy path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(jwt): cache __NO_MAPPING__ sentinel before raising 403 in REJECT mode REJECT mode was raising HTTPException immediately on a DB miss without writing the __NO_MAPPING__ sentinel, causing every subsequent rejected request to re-query the DB. Write the sentinel first so repeated rejections are served from cache within virtual_key_mapping_cache_ttl. Adds test asserting DB is not hit on the second reject after a cache-warm miss. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(jwt): enforce no-match policy when prisma_client is None The early `if prisma_client is None: return None` guard ran before the no-match policy check, silently bypassing REJECT and AUTO_REGISTER — every JWT client fell through to team auth regardless of configuration. Fix: treat prisma_client=None as a definitive DB miss and fall through to the same policy block as a real miss. REJECT now raises 403, AUTO_REGISTER raises 500 with a clear message (can't create keys without a DB), FALLBACK_TEAM_MAPPING returns None unchanged. Adds three tests: REJECT/403 with no DB, FALLBACK returns None with no DB, AUTO_REGISTER/500 with no DB. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(jwt): consistent AUTO_REGISTER on cached sentinel; clean up race orphans Addresses Greptile review on PR #25570 cherry-pick. 1. Inconsistent AUTO_REGISTER when __NO_MAPPING__ sentinel is cached: The cached-sentinel branch silently returned None when prisma_client was None, while the fresh path raised HTTP 500 under the same config. Same request, different access-control outcome depending on cache state. Both paths now raise the same 500. 2. Orphaned virtual keys from race-condition losers: On unique-constraint conflict, generate_key_helper_fn had already persisted an unrestricted virtual key in LiteLLM_VerificationToken with the cleartext in request memory. Under sustained concurrency these accumulated indefinitely. The loser now deletes its orphan before falling back to the winner's mapping; failure to delete is logged but does not fail the request. Also corrects a latent FK bug surfaced while fixing #2: the mapping row was storing the plaintext key in LiteLLM_JWTKeyMapping.token, but that column FKs to the hashed LiteLLM_VerificationToken.token — now hashed at the call site. Tests: - updated test_auto_register_creates_key_and_mapping to assert the hashed token is stored, not the plaintext - updated test_auto_register_race_condition_unique_conflict to assert the orphan is deleted with the correct hashed token - added test_auto_register_raises_500_when_sentinel_cached_and_no_db - added test_auto_register_race_conflict_tolerates_delete_failure Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jwt): close REJECT bypass when JWT omits the configured claim field A JWT presented without the configured `virtual_key_claim_field` previously returned None at the `claim_value is None` guard before the `unregistered_jwt_client_behavior` check ran. A caller who knows the configured claim-field name could bypass REJECT by simply omitting that field and falling through to team-based JWT auth. Apply the no-match policy on a missing claim: - REJECT → 403 - AUTO_REGISTER → 403 (no stable identity to map; refuse rather than create a sentinel-keyed record) - FALLBACK_TEAM_MAPPING → return None (unchanged, backward-compatible) Adds three tests covering each branch of the missing-claim path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jwt): AUTO_REGISTER inherits team_id so keys are bounded by team limits Auto-registered virtual keys were created with no team, model, route, rate, or budget constraints — broader access than the standard team-based JWT auth path the same client would have taken. Under AUTO_REGISTER, resolve the team_id from the JWT (via the operator-configured team_id_jwt_field / team_id_default) and stamp it on the new key. Downstream auth then applies the team's budget/models/tpm/rpm/allowed_routes via the existing virtual-key flow. Policy when team_id_jwt_field is configured: - JWT carries team claim → stamp resolved team_id - JWT lacks claim + team_id_default set → stamp default - JWT lacks claim + no default → 403 (refuse to create an unbounded key) When neither team_id_jwt_field nor team_id_default is configured, the operator has explicitly opted out of team-based limits — the auto-created key has no team_id (matches what team-auth would do in the same config). Adds 4 tests covering each branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jwt): make AUTO_REGISTER functional in prod; raise on missing winner Two correctness fixes flagged by Greptile on the AUTO_REGISTER path: 1. generate_key_helper_fn was called without table_name="key". Without that, the helper falls into the user-upsert branch (table_name in (None, "user")) and tries to insert into LiteLLM_UserTable with user_id=None, which hits the NOT NULL @id constraint. AUTO_REGISTER would never have succeeded in production. Now passes table_name="key" explicitly, matching the /key/generate caller. 2. When the race loser refetches the winner's mapping and gets None (winner row concurrently deleted), the previous code returned None — and the caller in _resolve_jwt_to_virtual_key then fell through to less- restrictive team-based JWT auth, silently bypassing the configured AUTO_REGISTER policy. Now raises HTTP 503 so the caller retries against a stable state rather than getting unintended fallback access. Adds one test for the 503 winner-vanishes path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jwt): defer AUTO_REGISTER until JWT policy is enforced by auth_builder Closes the JWT policy bypass on the AUTO_REGISTER path flagged by veria-ai. Before: when unregistered_jwt_client_behavior=auto_register and the JWT's claim was unmapped, _resolve_jwt_to_virtual_key validated the JWT signature and then immediately created a virtual key + mapping. JWTAuthManager.auth_builder never ran for the first request (the new key short-circuited the team-auth path), and every subsequent request hit the cached mapping — so custom_validate, RBAC, scope_mappings, and user_allowed_email_domain were never enforced for auto-registered clients. After: _resolve_jwt_to_virtual_key returns a _PendingAutoRegister signal instead of creating the key. The caller in _user_api_key_auth_builder runs JWTAuthManager.auth_builder, then — only on a validated, policy-passing result — calls _auto_register_jwt_mapping with the team_id / user_id from that result. The created key inherits team + user limits from the validated identity, and future cache hits load that already-policy-checked key. Also drops the interim _resolve_inherited_team_id helper that pulled team_id from raw JWT claims — same bypass risk; team_id now comes exclusively from auth_builder. Tests: - Rewrote two existing tests to assert _resolve_jwt_to_virtual_key returns _PendingAutoRegister (no key created yet) for both the fresh-DB-miss and stale-sentinel branches - Added a contract test that _auto_register_jwt_mapping stamps the validated team_id/user_id onto generate_key_helper_fn - Removed four stale team-binding tests that exercised the prior raw-claim helper Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Update user_api_key_auth.py * fix(jwt): cache proxy-admin AUTO_REGISTER path to avoid repeated DB lookups Cache-miss regression introduced by the deferred-auto-register refactor: when a JWT under AUTO_REGISTER resolved to a proxy admin, the is_proxy_admin early-return in _user_api_key_auth_builder ran *before* the pending auto-register cache-write block. Result: no cache entry, so every subsequent proxy-admin request re-queried get_jwt_key_mapping_object indefinitely. Fix: write a __JWT_PROXY_ADMIN__ sentinel to user_api_key_cache before the early return when a pending auto-register existed. _resolve_jwt_to_virtual_key treats that sentinel as "skip mapping, fall through to auth_builder", so future requests from the same JWT identity hit the cache instead of the DB. auth_builder still runs full JWT policy on every request — only the mapping DB lookup is short-circuited. Adds one test asserting the sentinel cache-hit returns None without hitting prisma_client.db.litellm_jwtkeymapping.find_first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(proxy): stamp org context on JWT auto-registered keys AUTO_REGISTER keys were created with team_id and user_id only, so org budget checks were skipped after switching to the key-scoped path. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com> |
||
|
|
41e90a6ada
|
chore(ui): remove the bare-fetch lint rule (#29712)
* fix(ui): only flag bare fetch() outside React Query queryFn/mutationFn The frontend lint rule banned every fetch() call by static AST name match, so a fetch wrapped in a React Query queryFn/mutationFn tripped it just like a loose fetch in a component. esquery (no-restricted-syntax) can't express "has ancestor", so this replaces that selector with a small custom rule (local/no-bare-fetch) that exempts a fetch lexically inside a queryFn or mutationFn and reports everything else. Re-baselined eslint-suppressions.json under the new rule id (same 44 files / 331 violations) so existing code keeps its grandfathered suppressions. Adds a RuleTester suite covering wrapped (valid) vs unwrapped, the standalone *Api.ts function pattern, queryKey, and computed-key cases. * chore(ui): remove the bare-fetch lint rule Drop the fetch lint gate (and its 331 grandfathered suppressions) ahead of the networking refactor. The plan is to centralize all fetching in a single shared http client and enforce that with a location-based rule, so keeping a fetch rule in place now would only block CI while functions are routed through the new client. Removing it unblocks that work; the location-based rule lands with the client in a follow-up. |
||
|
|
770fff7058
|
test(proxy): stop running real-DB tests in GitHub Actions unit jobs (#29700)
* test(proxy): stop running real-DB tests in GitHub Actions unit jobs GitHub Actions unit jobs were spinning up a Postgres service container, but the only active tests that touched it either used the DB incidentally (a cargo-culted prisma_client.connect()) or were genuine integration tests mislabeled as unit. Mock the incidental ones so the proxy-db job needs no container, and move the tests that genuinely need a database (proxy management behavior, master-key-not-persisted, schema-migration sync) to CircleCI, which is already the real-infrastructure lane. * test(proxy): restore no-unexpected-startup-writes canary in master-key test Greptile noted the hash-match assertion no longer catches other unexpected startup writes (a default key, a rotation artifact). The CircleCI job gives each run a fresh DB, so a clean startup must leave the table empty; add that canary back alongside the precise master-key assertion. |
||
|
|
1dbf46665e
|
test: make custom_tokenizer proxy tests hermetic (#29643)
test_custom_tokenizer_bug.py loaded Xenova/llama-3-tokenizer from HuggingFace Hub at test time, so it flaked on shared CI runners whenever HF returned 429 Too Many Requests; the surfaced LocalEntryNotFoundError made it look like a connectivity bug. Rewrite the suite to mock the one network boundary (litellm.utils.Tokenizer.from_pretrained) while running the proxy's real extraction-and-selection path. The regression test now asserts the configured identifier from model_info.custom_tokenizer actually reaches from_pretrained and that the response reports the huggingface tokenizer, which the previous llama-3-named test could not distinguish from the default path. A control test pins the no-custom-tokenizer case to the OpenAI tokenizer with from_pretrained asserted unused. Verified by reintroducing the original bug (model_info left unpopulated from the deployment): the regression test fails (from_pretrained called 0 times) while the control stays green. |
||
|
|
9344f205a8
|
fix(proxy): add default=None to LiteLLM_TeamMembership.litellm_budget_table (#29684)
In Pydantic v2, Optional[T] without a default is a required field. Any row with budget_id=null triggered a validation error and returned 401. Co-authored-by: Florent Chenebault <florent.chenebault@lifen.fr> |
||
|
|
f9142d7961
|
fix(helm): Enable Backend Deployment to mount Gateway config.yaml (#29605)
* change deployment configs to include a litellm.cache for litellm-backend pod mirroring litellm-gateway pod * omit backend annotations block when config and podAnnotations are both empty * reuse gateway config/configmap for backend instead of separate backend config --------- Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: Tin Chi Lo <tin@Tins-MBP.localdomain> Co-authored-by: Tin Chi Lo <tin@Tins-MacBook-Pro.local> |
||
|
|
568d291b99
|
chore: ignore prettier dashboard reformat in git blame (#29695)
Add the squash-merged SHA of #29622 (style(ui): run prettier --write across the dashboard) to .git-blame-ignore-revs so the bulk reformat stops masking the real authors of those lines in git blame and the GitHub blame UI |
||
|
|
7edf3a9cb5
|
style(ui): run prettier --write across the dashboard (#29622)
Formatting-only pass; no logic changes. Brings the UI into compliance with .prettierrc so the new format-check CI job passes |
||
|
|
cb041966bf
|
Litellm oss staging 040626 (#29671)
* fix(azure): apply api_version fallback chain to image edit URL
`AzureImageEditConfig.get_complete_url` only read `api_version` from
`litellm_params`. When callers configured it via `litellm.api_version`
or `AZURE_API_VERSION`, the constructed URL had no `?api-version=` and
Azure responded `404 Resource not found`.
Apply the same fallback chain the Azure chat path already uses in
`common_utils.py`:
litellm_params > litellm.api_version > AZURE_API_VERSION env >
litellm.AZURE_DEFAULT_API_VERSION
Adds 5 unit tests pinning each layer of the chain plus a regression
guard for `api_base` that already carries `?api-version=`.
* feat(mcp): core sampling and elicitation flow with security hardening
- Add sampling_handler.py: full MCP sampling/createMessage flow with
model selection (hint-based + priority-based), auth enforcement,
budget checks, route restriction gates, and tag policy pre-auth
- Add elicitation_handler.py: MCP elicitation/create relay with
downstream client capability detection
- Wire sampling/elicitation callbacks in mcp_server_manager.py
gated behind allow_sampling/allow_elicitation config flags
- Add allow_sampling/allow_elicitation fields to MCPServer type
- Fix session lock deadlock: skip lock for JSON-RPC response POSTs
(elicitation/sampling replies) with truncated-body heuristic
- Extend client.py with sampling_callback and elicitation_callback
- Security: RouteChecks gate, tag-budget bypass fix, x-forwarded-for
spoofing fix, Latin-1 header encoding guard
- Add 4 new test modules (model access, priority selection, request
builder, tool conversion) + update existing MCP tests
* fix(security): run pre-call guardrails before MCP sampling acompletion
Without this, an upstream MCP server with allow_sampling enabled could
send prompts that bypass every guardrail (content filtering, PII
redaction, prompt-injection detection) configured on /chat/completions.
- Call proxy_logging_obj.pre_call_hook(call_type='acompletion') before
llm_router.acompletion so guardrails fire for sampling sub-calls
- Add HTTPException to the re-raise list so guardrail rejections
propagate correctly instead of being swallowed as generic errors
* feat(bedrock_mantle): add Responses API support (/openai/v1/responses) (#29490)
* feat(bedrock_mantle): add Responses API transformation config
* test(bedrock_mantle): cover trailing-slash api_base normalization
* feat(bedrock_mantle): export BedrockMantleResponsesAPIConfig
* feat(bedrock_mantle): register gpt-5.x Responses config (gpt-oss unchanged)
* feat(bedrock_mantle): add gpt-5.5/gpt-5.4 Responses price-map entries
* refactor(bedrock_mantle): exclude gpt-oss instead of allow-listing gpt-5 for Responses routing
Frontier OpenAI models on Bedrock Mantle are Responses-only on /openai/v1/responses;
gpt-oss is the legacy family that also speaks chat-completions. Gate by excluding
gpt-oss (which keeps its chat-completions emulation) and defaulting everything else
to the native Responses config, so future frontier models (gpt-6, etc.) route
correctly without a code change. Verified against the live us-east-2 Mantle endpoint:
gpt-oss 400s on /openai/v1/responses while gpt-5.5 400s on both standard paths.
* test(bedrock_mantle): cover supports_native_websocket opt-out
Closes the one uncovered line flagged by codecov on the Responses config.
The assertion documents that Mantle Responses has no realtime/websocket
transport, so realtime routing must not attempt a socket it cannot serve.
* fix(bedrock_mantle): route file_search through emulation instead of forwarding to Mantle
BedrockMantleResponsesAPIConfig inherited supports_native_file_search()
-> True from OpenAIResponsesAPIConfig but never overrode it. Mantle has no
OpenAI vector stores, so a forwarded file_search tool is rejected with a
400 (verified upstream: Tool type 'file_search' is not supported). Opting
out, like the existing supports_native_websocket override, routes the tool
through LiteLLM's file_search emulation instead.
* fix(bedrock_mantle): only route openai.gpt frontier models to Responses
The previous gate excluded gpt-oss and routed every other model to the
native Responses config. But on Mantle only the OpenAI gpt frontier models
(gpt-5.x) are served on /openai/v1/responses; gpt-oss and the non-OpenAI
families (nvidia, mistral, google, zai, ...) are chat-completions only and
400 on that path. Allow-list the openai.gpt- family (excluding gpt-oss)
instead, so chat-only models fall through to the chat-completions emulation.
Verified against the live us-east-2 endpoint: nvidia.nemotron-nano-9b-v2
returns 400 on /openai/v1/responses and 200 on /v1/chat/completions.
* feat(custom_llm): allow streaming/astreaming to yield ModelResponseStream (#27580)
* fix(custom_llm): allow streaming/astreaming to yield ModelResponseStream directly
* fix(streaming): enhance ModelResponseStream handling for custom LLM providers
* fix(streaming): strip finish_reason from content chunks and ensure tool_calls are preserved
* fix(streaming): add type ignore for finish_reason assignment in CustomStreamWrapper
* fix(proxy): strip stack trace from HTTP 503 responses (CWE-209) (#28330)
* fix(proxy/cwe-209): strip Python traceback from HTTP 503 error responses
The /cache/ping endpoint included a full Python traceback in its 503 error
response body (inside the ProxyException message), leaking internal file
paths, line numbers, and call stacks to any caller. Two MCP route handlers
in proxy_server.py similarly interpolated str(e) into "Internal server
error" detail strings.
Fix: log the traceback server-side via verbose_proxy_logger.exception()
and omit it from the ProxyException payload / HTTPException detail returned
to clients. Tests updated to assert no "traceback" keyword or frame paths
appear in the 503 body, with a new dedicated regression test.
CWE-209: Generation of Error Message Containing Sensitive Information.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(proxy/cwe-209): apply Greptile P2 fixes and add MCP exception-path tests
Greptile 4/5 review identified two remaining gaps and Codecov reported
0% coverage on the two MCP handler exception branches:
1. caching_routes.py — str(e) in "Service Unhealthy ({str(e)})" could
still leak Redis hostnames/IPs; replaced with static "Service Unhealthy".
HTTPException is now re-raised before the generic handler so the
"cache not initialized" 503 still reaches callers with its detail.
Removed the redundant str(e) arg from verbose_proxy_logger.exception()
(exception() already appends the traceback automatically).
2. tests — two new unit tests cover the exception paths in
dynamic_mcp_route and toolset_mcp_route that were previously at 0%:
- test_dynamic_mcp_route_unexpected_exception_returns_500_without_traceback
- test_toolset_mcp_route_unexpected_exception_returns_500_without_traceback
All 25 tests pass (9 caching + 16 MCP).
CWE-209: Generation of Error Message Containing Sensitive Information.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* test(caching_routes): restore precise assertion in test_cache_ping_no_cache_initialized
The assertion was weakened to `"Cache not initialized" in str(data)`, which
matches the raw string of the entire response dict and would pass even if the
error moved to an unexpected field or changed structure.
Restore a targeted check on the parsed response: assert the exact string in
the correct field `data["detail"]`, matching FastAPI's HTTPException
serialisation format {"detail": "<message>"}.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* test(caching_routes): restore precise assertion and add CWE-209 no-cache path test
The assertion in test_cache_ping_no_cache_initialized was weakened to
`"Cache not initialized" in str(data)`, which matched against the raw string
representation of the entire response dict. This would pass silently even if
the error message moved to an unexpected field or the structure changed.
Restore a targeted assertion on the parsed field:
assert data["detail"] == "Cache not initialized. litellm.cache is None"
matching FastAPI's HTTPException serialisation format exactly.
Add test_cache_ping_no_cache_does_not_expose_internals to show the code path
is still working correctly after the CWE-209 fix: verifies that the HTTPException
is re-raised as-is (no traceback, no source paths), and asserts the complete
response structure is exactly {"detail": "Cache not initialized. litellm.cache is None"}.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(caching_routes): restore ProxyException envelope for null-cache 503
The except HTTPException: raise guard (added in the CWE-209 fix) caused
the null-cache HTTPException to escape as FastAPI's {"detail": "..."} shape
instead of the {"error": {...}} ProxyException envelope that callers expect.
Move the null-cache guard before the try block and raise ProxyException
directly so the response structure is consistent with all other /cache/ping
503s, and the except HTTPException: raise guard is only reachable by
unexpected downstream HTTPExceptions.
Update the two no-cache tests to assert the correct ProxyException envelope.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Update utils.py (#26609)
* feat(pricing): add Snowflake Cortex REST API model pricing (#26612)
* feat(pricing): add Snowflake Cortex REST API model pricing
## Summary
Adds pricing and context window information for 20+ Snowflake Cortex REST API models to `model_prices_and_context_window.json`.
## What's included
- **7 Claude models** (sonnet-4-5, sonnet-4-6, 4-sonnet, 4-opus, haiku-4-5, 3-7-sonnet, 3-5-sonnet) — with prompt caching rates
- **4 OpenAI models** (gpt-4.1, gpt-5, gpt-5-mini, gpt-5-nano) — with prompt caching rates
- **5 Llama models** (3.1-8b, 3.1-70b, 3.1-405b, 3.3-70b, 4-maverick)
- **1 DeepSeek model** (deepseek-r1)
- **1 Mistral model** (mistral-large2)
- **1 Snowflake model** (snowflake-llama-3.3-70b)
- **2 Embedding models** (arctic-embed-l-v2.0, arctic-embed-m-v2.0)
Each entry includes `input_cost_per_token`, `output_cost_per_token`, `cache_read_input_token_cost` (where applicable), `max_input_tokens`, `max_output_tokens`, and capability flags (`supports_function_calling`, `supports_vision`, `supports_prompt_caching`, `supports_reasoning`).
## Pricing source
All prices are in USD per token, sourced from the official [Snowflake Service Consumption Table](https://www.snowflake.com/legal-files/CreditConsumptionTable.pdf) — Tables 6(b) (REST API with Prompt Caching) and 6(c) (REST API).
## Context
The existing `snowflake/` provider has zero model entries in the pricing JSON, which means LiteLLM cannot track costs for Snowflake Cortex calls. This PR fills that gap.
## Related
- Existing provider: `litellm/llms/snowflake/`
- Cortex REST API docs: https://docs.snowflake.com/en/user-guide/snowflake-cortex/cortex-rest-api
* Update model_prices_and_context_window.json
Fix the JSON parsing error
* Update model_prices_and_context_window.json
Removed the duplicate entry
* fix(utils): copy extra_body before adding unknown params to prevent model config mutation (#29620)
Fixes #29615. In add_provider_specific_params_to_optional_params, the line:
extra_body = passed_params.pop("extra_body", None) or {}
returns the original dict reference when extra_body is non-empty (truthy).
Subsequent writes like extra_body[k] = passed_params[k] then mutate the
shared model config object held by the router, poisoning /model/info and
all subsequent requests for that deployment.
The or {} short-circuit creates a new dict only when extra_body is falsy
(None or {}), which is why the bug does not reproduce with extra_body: {}.
Fix: wrap in dict() so we always work on a fresh shallow copy.
* fix(vertex_ai): Bake tool_choice into Gemini CachedContent body to prevent silent drop (#29097)
* fix(vertex_ai): bake tool_choice into Gemini CachedContent body to prevent silent drop
* address greptile feedback on tool_choice cache test
* adds test that uses ToolConfig(functionCallingConfig=FunctionCallingConfig(mode=ANY)) instead of a dict literal, mirroring what map_tool_choice_values actually produce
* fix(gemini/veo): move image from parameters into instances[0] (#29501)
* fix(gemini/veo): move image from parameters into instances[0]
Veo's predictLongRunning schema puts image (and prompt) on the
instances element; parameters is for aspectRatio/durationSeconds/etc.
The Gemini path was leaving image in params_copy, so it ended up
nested under parameters and the API silently ignored it.
The Vertex path already builds the instance dict explicitly, so this
just aligns the Gemini path with it.
Fixes #29498
* address greptile: unconditional pop + BytesIO test
- Pop `image` from params_copy unconditionally so it never reaches
GeminiVideoGenerationParameters even when None, removing implicit
reliance on Pydantic's extra-field-ignore.
- Add test_transform_video_create_request_image_filelike_goes_to_instance
covering the BytesIO path (_convert_image_to_gemini_format) — round-trips
the base64 to confirm encoding.
- Add test_transform_video_create_request_image_none_is_dropped covering
the new None branch.
* fix(huggingface): handle special token text in embedding usage (#29660)
* fix(guardrails): recompile ToolPermissionGuardrail rules on update_in_memory_litellm_params (#29655)
* fix(guardrails): recompile ToolPermissionGuardrail rules on update_in_memory_litellm_params
ToolPermissionGuardrail builds self.rules and the compiled target/pattern
maps only in __init__. The base update_in_memory_litellm_params re-sets raw
attributes via setattr but never rebuilds those maps, so a guardrail updated
in place (PUT /guardrails, or the immediate in-memory sync) keeps enforcing
the construction-time rules until it is reinitialized (PATCH path, periodic
DB poll, or restart).
Extract the compile step into _load_rules and override
update_in_memory_litellm_params to rebuild from it (dict- and model-safe),
re-normalizing default_action / on_disallowed_action. Mirrors the existing
PresidioGuardrail override of the same method. Adds regression tests.
Fixes #29592.
* fix(guardrails): handle dict params in ToolPermissionGuardrail in-memory update
Delegate to super() only for LitellmParams input (the base setattr loop is
model-only); apply the raw-dict case inline. Fixes the mypy arg-type error
and makes the recompile work when the proxy passes the raw DB dict.
* fix(guardrails): preserve tool-permission rules on a partial in-memory update
A partial update (e.g. a LitellmParams whose rules field is None) ran through
the generic setattr, which set self.rules to None, and the recompile was
skipped, leaving the guardrail with no rules. Snapshot the previous rules and
restore them when the update carries no rules; an explicit empty list still
clears them. Adds a regression test for the rules-absent case.
Addresses the Greptile review note on #29655.
* fix(bedrock): stop base_model label from stripping tools/tool_choice (#29621)
* fix(bedrock): stop base_model label from stripping tools/tool_choice
A Router/proxy Bedrock deployment whose model_info.base_model is a friendly
label (e.g. claude-haiku-4-5) silently lost tools/tool_choice: the outgoing
Converse request was built without toolConfig, so the model behaved as if no
tools were provided. Worked in v1.84.0, regressed in v1.85.0, and with
drop_params=true it failed silently.
Two changes compound into the bug. completion() passed model_info.base_model
as the model argument to get_optional_params, so the real Bedrock model id
never reached supported-param resolution; and get_supported_openai_params
resolved the provider config's params from base_model or model, letting the
label fully replace the real model. For Bedrock the label resolves to no tool
support, so tools/tool_choice were dropped before transformation.
completion() now keeps model as the real deployment model and threads the
resolved base_model (kwarg or model_info) through separately, and
get_supported_openai_params treats base_model as additive: it returns the
union of the params supported by model and by base_model. A hint can only add
capabilities, never strip ones the real model already exposes, which also
preserves the original base_model behavior from #27717 and Azure's base_model
driven model-type detection.
Fixes #29618
* test(main): make base_model param test robust to new parametrize cases
Restore an explicit per-case expected_model_param literal instead of
hardcoding the gemini id, so a future case with a different model can't
produce a misleading assertion failure.
* fix(fireworks_ai): pass response_format json_schema through unchanged (#29606)
FireworksAIConfig.map_openai_params was rewriting the OpenAI strict
`{type: json_schema, json_schema: {name, strict, schema}}` shape into
`{type: json_object, schema: ...}` before sending to Fireworks, dropping
`strict` and `name` and changing the `type`. Per Fireworks' docs json_object
means "force any valid JSON output (no specific schema)", so the schema
constraint was effectively dropped and grammar-guided decoding never ran;
model output silently violated the schema.
The rewrite landed in #7085 (Dec 2024) when Fireworks did not yet accept
native json_schema. Fireworks accepts the OpenAI strict shape natively now,
so the rewrite has become a regression.
Removes the rewrite. Passes response_format through unchanged. Updates the
existing test_map_response_format to assert pass-through. Adds focused
regression tests in tests/test_litellm/ covering preservation of type,
strict, name, and schema body, plus that json_object alone still works.
* fix(types): import Required from typing_extensions in gemini types
* style: reformat sampling_handler.py for py312 black compat
* refactor(mcp-sampling): extract helpers to fix PLR0915 too-many-statements in handle_sampling_create_message
* fix(proxy-server): add explicit ProxyLogging type annotation to proxy_logging_obj to fix mypy inference
* fix(mcp-sampling): suppress mypy assignment error on ImportError fallback for proxy_logging_obj
* fix(test): use .value when comparing LlmProviders enum against string in test_default_api_base
* fix(test): iterate LlmProviders enum in test_default_api_base to avoid str pollution from custom provider registration
litellm.provider_list is a mutable global initialized to list(LlmProviders) but custom_llm_setup() appends plain provider strings to it. When a test_custom_llm.py test runs first in the same xdist worker, provider_list contains a str and calling .value on it raises AttributeError. Iterate the immutable LlmProviders enum instead, which is deterministic and what the check intends.
* fix(mcp): depth-aware JSON-RPC response detection and neutral speed-priority fallback
Replace the flat substring check in the truncated-body routing path with a
top-level-key scan so a JSON-RPC response whose result payload nests a
"method" field is still detected as a response and skips the session lock,
removing a deadlock against the in-flight tool call awaiting it.
Drop the inverse max_output_tokens speed proxy when no model exposes
output_tokens_per_second; context-window size does not track latency, so a
neutral score avoids biasing speedPriority toward the smallest-context model.
* fix(guardrails): make ToolPermission rule reload atomic on invalid regex
_load_rules appended each rule to self.rules before compiling its regex, so an
invalid pattern raised mid-loop after the bad rule was already live but without
a _compiled_rule_targets entry. _matches_regex reads a missing compiled target
as a None pattern and returns True, turning the bad rule into a match-all that
silently applies its decision to every tool. Via update_in_memory_litellm_params
(PUT /guardrails) this corrupted the live guardrail.
Build the parsed rules and compiled maps into locals and swap them in only after
every regex compiles, and restore the previous ruleset if a live update is
rejected, so an invalid regex now fails the update without leaving the guardrail
enforcing a broken policy.
* test(mcp): cover sampling conversion, model resolution, and elicitation relay paths
The MCP sampling and elicitation handlers shipped with partial test
coverage, leaving the response-to-MCP conversion, the model resolution
fallback chain, completion-kwargs assembly, guardrail routing, and the
entire elicitation relay untested. That pulled the PR's diff (patch)
coverage below the codecov threshold even though overall project
coverage rose.
Add focused unit tests for _convert_openai_response_to_mcp_result,
_convert_mcp_tools_to_openai, _convert_mcp_tool_choice_to_openai, image
and audio content conversion, the hint-matching and fallback branches of
_resolve_model_from_preferences, _build_completion_kwargs, the router and
guardrail-rejection paths of _run_guardrails_and_call_llm, the
handle_sampling_create_message success and error-propagation flows, the
marker-hoisting fallback for tool content on unexpected roles, and the
elicitation form/url/generic relay together with its decline paths
---------
Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: lengkejun <lengkejun@xd.com>
Co-authored-by: Yug <yugborana000@gmail.com>
Co-authored-by: Kent <72616338+kingdoooo@users.noreply.github.com>
Co-authored-by: tanmay958 <53569547+tanmay958@users.noreply.github.com>
Co-authored-by: DrishnaTrivedi <142084770+DrishnaTrivedi@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Navnit Shukla <Navnit.shukla25@gmail.com>
Co-authored-by: PRABHU KIRAN VANDRANKI <72809214+VANDRANKI@users.noreply.github.com>
Co-authored-by: Adrian Lopez <109683617+adriangomez24@users.noreply.github.com>
Co-authored-by: hcl <chenglunhu@gmail.com>
Co-authored-by: JooHo Lee <96564470+BWAAEEEK@users.noreply.github.com>
Co-authored-by: Dinesh Girbide <85330597+Dinesh-Girbide@users.noreply.github.com>
Co-authored-by: cloudwiz <22098246+andrey-dubnik@users.noreply.github.com>
Co-authored-by: Ahmad Khan <ahmadkhan2508@gmail.com>
Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>
|
||
|
|
ed073d382d
|
fix(gemini-realtime): use GA event names for Pipecat 1.3.x compatibility (#29662)
* fix(gemini-realtime): use GA event names for Pipecat 1.3.x compatibility Pipecat v1.3.0 adopted the OpenAI Realtime API GA event naming: response.audio.delta -> response.output_audio.delta response.text.delta -> response.output_text.delta response.audio.done -> response.output_audio.done response.text.done -> response.output_text.done The proxy was still emitting the old beta names; Pipecat's `parse_server_event` raises "Unimplemented server event type" for any unknown type, which killed the receive task handler and broke audio playback and tool-call delivery. Also: - conversation.item.created -> conversation.item.added (already handled) - client audio is buffered until backend setupComplete in deferred mode - call_id fallback UUID when Gemini returns empty id - status_details / token detail fields added to Pydantic-strict events The _GA_TO_BETA_EVENT_TYPES map in RealTimeStreaming already translates GA names back to beta for clients that opt in with the openai-beta header, so legacy clients are unaffected. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(gemini-realtime): address greptile review comments - emit outputTranscription as response.output_audio_transcript.delta instead of suppressing it; GA_TO_BETA map handles translation for legacy clients - cap pre-setup audio buffer at 200 frames to prevent memory exhaustion; log a warning when the limit is hit and additional frames are dropped - log remaining dropped message count on flush error Co-authored-by: Cursor <cursoragent@cursor.com> * fix(gemini-realtime): address veria review comments - remove unused OpenAIRealtimeConversationItemCreated import - fix guardrail bypass: semantic_vad early-return now preserves create_response when set so a guardrail-injected create_response:false is not silently dropped - add per-connection 10 MB byte cap alongside the 200-frame count cap for the pre-setup audio buffer to prevent memory exhaustion Co-authored-by: Cursor <cursoragent@cursor.com> * fix(gemini-realtime): fix mypy arg-type on _finalize_gemini_live_setup setup parameter typed as BidiGenerateContentSetup to match the TypedDict passed at both call sites; was dict which mypy rejected. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(gemini-realtime): widen _finalize_gemini_live_setup to Dict[str, Any] BidiGenerateContentSetup (TypedDict) is a subtype of Dict[str,Any] so both call sites (one passing a plain dict, one passing the TypedDict) satisfy mypy. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(gemini-realtime): cast BidiGenerateContentSetup to Dict at _finalize call site mypy rejects TypedDict as dict[str, Any] argument; cast at the call site where follow_up_setup is BidiGenerateContentSetup to satisfy the checker. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix Gemini realtime beta compatibility * Fix deferred Gemini setup audio ordering * fix: preserve Gemini audio transcript ids * fix(realtime): cap pre-setup client buffer on all append paths Route every append to the deferred-setup pending buffer through the per-connection message/byte caps. Previously only the audio-buffer fast path enforced the caps; once one frame was buffered, a client that withheld session.update could stream arbitrary frames into _pending_messages_until_setup unbounded and exhaust proxy memory. * style(gemini-realtime): apply black formatting to transformation.py * fix(gemini-realtime): log beta-translation fallback and name native-audio marker Surface the previously swallowed exception in _send_event_to_client so a failed GA->beta translation is observable instead of silently forwarding the untranslated event. Extract the native-audio model substring used by _finalize_gemini_live_setup into a named constant documenting why speechConfig is dropped on those setups. --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com> |
||
|
|
20dc6dffa4
|
fix(proxy): passthrough 404 when SERVER_ROOT_PATH is set (#29658)
* fix(proxy): match passthrough registry routes bare-to-bare with SERVER_ROOT_PATH After #28547, get_request_route strips the deployment prefix while registry lookup still re-inflated stored paths via SERVER_ROOT_PATH, causing 404s under paths like /llmproxy/ml. Compare normalized bare routes in both is_registered_pass_through_route and get_registered_pass_through_route. Co-authored-by: Cursor <cursoragent@cursor.com> * test(proxy): patch utils.get_server_root_path in passthrough auth tests After removing get_server_root_path from pass_through_endpoints, route and JWT tests must mock litellm.proxy.utils where normalization reads it. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> |
||
|
|
216c68db04
|
fix(gemini): googleSearch + server-side tools and googleMaps JSON schema (#29582)
* fix(gemini): keep googleSearch with server-side tools and googleMaps JSON schema Wire include_server_side_tool_invocations through completion() so mixed google_search and function tools are not dropped on Gemini 3+. Rewrite generationConfig to responseFormat when googleMaps is used with JSON schema. Fixes #27479 Fixes #29451 Co-authored-by: Cursor <cursoragent@cursor.com> * address greptile review feedback (greploop iteration 1) * style: fix black formatting in main.py for py312 compat * Fix Gemini Google Maps extra_body JSON rewrite --------- Co-authored-by: Cursor <cursoragent@cursor.com> |
||
|
|
443f0ca4cd
|
ci(ui): frontend-lint job enforcing prettier + eslint on changed files (#29633)
* ci(ui): add frontend-lint job enforcing prettier and eslint on changed files
Lints only the files a PR adds or modifies under ui/litellm-dashboard,
so new and touched code must be prettier-clean and eslint-clean while the
existing tree is grandfathered. Skips cleanly when a PR touches no
lintable UI files. This lets us adopt the formatters incrementally
without a repo-wide reformat
* ci(ui): write frontend-lint file lists to $RUNNER_TEMP
Keep the prettier/eslint changed-file lists out of the checkout dir so
they cannot collide with a future source file of the same name
* lint(ui): baseline existing eslint findings so only new ones block
Capture the current error-level eslint findings (318 across 183 files)
in a committed suppressions baseline via eslint --suppress-all. Every
rule stays at its error severity, so any newly introduced violation
fails the frontend-lint gate, while the existing tree is grandfathered;
touching a legacy file never forces fixing its pre-existing issues. CI
runs eslint with --pass-on-unpruned-suppressions so that fixing a
baselined issue does not fail on a now-stale suppression, and the
generated baseline is prettier-ignored since eslint owns its format.
Burn the baseline down over time with eslint --prune-suppressions
* lint(ui): enforce a count budget for explicit any
Make @typescript-eslint/no-explicit-any a warning and cap the total
instead of hard-blocking each new one. A frontend-lint step counts the
repo-wide explicit any and fails only when it exceeds the committed
budget in eslint-any-budget.json. max starts at 2031, ten above the
current 2021, so the next ten land as warnings and the build fails once
that headroom is gone. Lower max over time toward target to ratchet the
count down. New anys still surface as warnings on changed files via the
normal eslint step
* lint(ui): enable zero-cost rules no-var, no-self-assign, react/no-danger
These have no existing violations, so they need no baseline; turning them
on purely blocks new instances. react/no-danger guards against new
dangerouslySetInnerHTML (XSS), no-var enforces let/const, and
no-self-assign catches self-assignment typos. no-debugger is already
enforced by the recommended preset
* lint(ui): add baselined complexity rules
Enable complexity:20, max-depth:4, max-params:4, max-nested-callbacks:4,
with thresholds set near the codebase p99 so only genuine outliers are
flagged. The 272 existing over-threshold functions are grandfathered in
the suppressions baseline; new over-threshold functions block. Lower the
thresholds over time to ratchet complexity down. max-lines-per-function
is intentionally left off since React components are legitimately long
* lint(ui): ban new raw fetch, standardize on React Query
Add a no-restricted-syntax rule flagging bare fetch() calls, pointing
contributors at React Query (@tanstack/react-query). The rule is not
exempted anywhere, including the already-bloated networking.tsx, so all
331 existing fetch calls are grandfathered but no new ones can be added
there or elsewhere. New data access goes through React Query, and the
networking layer can be migrated out and pruned from the baseline over
time
* lint(ui): ban new @tremor/react imports
Add a no-restricted-imports rule flagging imports from @tremor/react so
tremor is phased out rather than spread further. The 232 existing tremor
imports are grandfathered in the baseline; new ones block and point at
antd. Migrate components off tremor and prune the baseline over time
* lint(ui): widen explicit-any budget headroom to 2040
Raise max from 2031 to 2040, giving ~19 of slack over the current 2021
instead of 10
* style(ui): prettier-format eslint.config.mjs
The frontend-lint gate flagged its own config file. Format it so the
prettier check on this PR's changed files passes
* lint(ui): soften complexity and max-depth to warnings
These two are smell metrics with arbitrary thresholds where a legit new
function can trip them, so make them advisory rather than hard-blocking.
They drop out of the baseline (now 963). max-params, max-nested-callbacks,
and the react-hooks rules stay strict since those are clear-cut
* lint(ui): move complexity and max-depth to the count-budget pattern
Generalize the explicit-any budget into a shared lint-budget mechanism:
eslint-budgets.json maps a rule to {max, target} and check-lint-budgets.mjs
counts each across the repo and fails when a count exceeds its max.
complexity (129, max 140) and max-depth (61, max 70) now use the same
slack-plus-counter model as explicit-any (2021, max 2040): they warn
per-file and the build only fails if the repo-wide total crosses the
ceiling. Lower each max toward its target over time
* docs(ui): note pruning the eslint suppressions baseline when fixing lint debt
|
||
|
|
9196098e9e
|
fix(mcp): gate /public/mcp_hub strictly on litellm.public_mcp_servers (#27764)
* fix(mcp): gate /public/mcp_hub strictly on litellm.public_mcp_servers * fix(mcp): add public_mcp_hub_strict_whitelist flag (default True) for migration |
||
|
|
be7b9319d2
|
fix(proxy): disable proxy buffering on streaming SSE responses (#29557)
Streaming responses from the proxy (/chat/completions, /v1/messages, /v1/responses, assistants) all return through create_response() but never sent the headers that tell an intermediary reverse proxy not to buffer the SSE stream. nginx with the default proxy_buffering, k8s ingress-nginx, and Envoy/Istio sidecars therefore hold the whole stream and release it in one batch, which looks like a broken/buffered stream to the client even though litellm is yielding chunks incrementally. Add Cache-Control: no-cache and X-Accel-Buffering: no to every StreamingResponse create_response() returns, matching what the proxy already does for its own usage/policy SSE endpoints. Fixes #28384. |
||
|
|
e9417603a3
|
fix(key_generate): scope session-token team-key budget exemption to caller-supplied team_id (#29641)
#29612 exempts UI/CLI session tokens from the key budget ceiling when they create a team key, keyed on data.team_id. That value is read after the default_key_generate_params loop can populate team_id, so on deployments that set default_key_generate_params.team_id a request the caller did not scope to a team is treated as a team key and skips the ceiling. Capture _requested_team_id before defaults run and key the exemption off it, mirroring how _requested_max_budget is already captured. Requests the caller did not scope to a team keep the ceiling. |
||
|
|
c7f1bcfd0d
|
build(ui): migrate eslint to flat config and bump eslint-config-next to 16 (#29626)
ESLint 9 defaults to flat config and eslint-config-next was pinned at 15 while Next is on 16, so eslint only ran with ESLINT_USE_FLAT_CONFIG=false and next lint is gone on Next 16. Replace .eslintrc.json with a native flat eslint.config.mjs (config-next 16 ships flat configs, so no FlatCompat shim is needed), bump eslint-config-next to 16.2.6, add @eslint/js and typescript-eslint as explicit devDeps for the recommended rule sets, and point the lint script at eslint directly. This only makes eslint runnable on modern tooling; it does not wire it into CI. The same rules carry over (next/core-web-vitals, eslint and typescript-eslint recommended, prettier, unused-imports) |
||
|
|
5ee526d78e
|
fix(realtime): allow null transcripts in stream logging payloads (#29625)
Allow realtime event transcript fields to be nullable so GA conversation.item payloads with transcript=null don't fail logging normalization and suppress success callbacks. Co-authored-by: Cursor <cursoragent@cursor.com> |
||
|
|
97ba7e1a30
|
fix(key_generate): exempt UI/CLI session tokens from the budget ceiling for team keys (#29612)
Non-admin users creating a team key through the UI were rejected with "max_budget cannot exceed the caller's own max_budget (0.25)". The request is authenticated by a UI/CLI session token whose max_budget is the per-session chat spend cap (max_ui_session_budget, default $0.25), and the delegated-authority budget ceiling (GHSA-q775-qw9r-2r4g) treated that cap as a delegation limit. Skip the ceiling only when a session token creates a team key (data.team_id set); that key's spend is bounded by the team budget at request time. Personal keys and every other non-admin caller keep the ceiling, so a session token cannot mint an arbitrary-budget personal key. |
||
|
|
b4aee2c7dd
|
test(vcr): close out the remaining VCR live-call leaks (#29603)
* Fix remaining VCR live-call leaks * test(vcr): dedupe live-test helpers and drop spurious kwargs Extract the duplicated isVertexQuotaError/runVertexRequestOrSkip Vertex quota-skip helpers into tests/pass_through_tests/vertex_test_helpers.js and the duplicated _skip_live_prompt_caching_test guard into tests/_live_test_helpers.py so each lives in one place. In test_aarun_thread_litellm, build a separate message_data carrying role/content for add_message and a thread_data without them for run_thread/run_thread_stream/get_messages, which no longer receive the spurious message fields. * test(overhead): assert mock transport is exercised in non-streaming and stream tests |
||
|
|
84969aaf15
|
fix(ci): keep coverage rename green when a parallel node runs no tests (#29608)
* fix(ci): keep coverage rename green when a parallel node runs no tests
local_testing_part1 and local_testing_part2 run with parallelism 4. When
CircleCI reruns only the failed tests, the failed test lands on a single
node and the other nodes receive an empty bucket, so pytest never writes
coverage.xml or .coverage. The unguarded "mv coverage.xml ..." then exits
1 and turns the whole job red even though the rerun passed; the next
persist_to_workspace step would fail the same way on the missing paths.
Guard the rename so a node with no coverage emits empty placeholders
instead. coverage combine tolerates the empty files, so the downstream
upload-coverage job keeps the real nodes' data intact.
* fix(ci): pre-create test-results in litellm_router_testing for empty-bucket reruns
litellm_router_testing also runs with parallelism 4. On a rerun of only the
failed tests, a node can receive no tests, so the test command never creates
test-results and the final store_test_results step can fail on the missing
path. Pre-create the directory up front, matching what local_testing_part1
and part2 already do and CircleCI's own guidance for parallel reruns.
* test(openai): retry wildcard chat completion on transient OpenAI 500
build_and_test reddened on test_openai_wildcard_chat_completion when the
real gpt-3.5-turbo-0125 call returned an OpenAI 500 ("The server had an
error while processing your request"). The base branch passed the same
call concurrently, so the 500 is an intermittent OpenAI server error, not
a regression. Add the same pytest-retry marker the sibling real-call tests
in this file already use so a transient upstream 500 no longer fails CI.
|
||
|
|
2bbdbfa5c3
|
fix: passthrough endpoints duplicate logs (#29598)
* fix duplicate cost callbacks for anthropic streaming pass-through Two bugs caused _PROXY_track_cost_callback to see stream=True + complete_streaming_response=None on every streaming pass-through request, making the dedup guard in dispatch_success_handlers permanently inactive: 1. pass_through_endpoints.py created the Logging object with stream=False for all requests. _is_assembled_stream_success short-circuits on self.stream is not True, so has_dispatched_final_stream_success was never set and any second dispatch went through unchecked. Fix: set logging_obj.stream = True after stream detection. 2. _create_anthropic_response_logging_payload set complete_streaming_response inside the try block after litellm.completion_cost(), so a pricing error caused an early return without setting it on model_call_details. Fix: set complete_streaming_response before the try block. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix stream * add stream to logging obj * test(pass_through): give mock logging object a real model_call_details dict The anthropic passthrough logging payload now records the assembled response on model_call_details before cost calculation, which requires model_call_details to support item assignment. In production it is always a dict; the existing unit test stubbed the logging object with a bare Mock whose attribute is not subscriptable, so the new assignment raised TypeError. Use a real dict to match the production logging object. * test(pass_through): cover streaming logging-obj stream flag The streaming branch of pass_through_request that marks the logging object as streaming (logging_obj.stream and model_call_details["stream"]) had no unit coverage, so the patch coverage gate flagged it. Add a regression test that drives a streaming pass-through request through pass_through_request and asserts the logging object is flagged as a stream before dispatch. * test(pass_through): cover SSE-response stream flag fallback branch The auto-detected streaming branch of pass_through_request (when a request that was not flagged as streaming returns a text/event-stream response) sets logging_obj.stream and model_call_details["stream"] but had no unit coverage, so the codecov patch gate failed at 60%. Drive a non-streaming pass-through request whose upstream response is SSE through pass_through_request and assert the logging object is flagged as a stream before dispatch. * fix(pass_through): gate complete_streaming_response on stream flag perform_redaction only scrubs complete_streaming_response when model_call_details["stream"] is True. Setting it unconditionally for non-streaming Anthropic pass-through responses left the assembled response unredacted in model_call_details, which is handed to logging callbacks as kwargs when message logging is disabled. Only record it for actual streaming responses so redaction always applies. --------- Co-authored-by: mubashir1osmani <mubashir.osmani777@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
5119b9462f
|
feat(arize/phoenix): OpenInference rendering parity — tool_calls, cost, passthrough I/O, session/user, multimodal, cache tokens (#28800)
* feat(arize): enrich OpenInference attributes for better span rendering
Pure rendering enhancements to the Arize / Arize Phoenix integration. No
existing attribute keys or values are removed or overwritten; every new
emit is independently try/except-wrapped and fires only when its source
data is present so existing behavior is preserved.
What this adds
- Coerce non-dict response objects (e.g. httpx.Response from passthrough
routes) via JSON decode so id/model/usage extraction stops crashing
with "'Response' object has no attribute 'get'". Dicts and Pydantic
objects with .get pass through unchanged.
- Set OPENINFERENCE_SPAN_KIND defensively early so a downstream failure
can't blank the kind; the original late write (incl. TOOL upgrade) is
preserved.
- Add "passthrough" keyword to _infer_open_inference_span_kind so
allm_passthrough_route / llm_passthrough_route resolve to LLM instead
of UNKNOWN.
- Emit cache token breakdown: LLM_TOKEN_COUNT_PROMPT_DETAILS_CACHE_READ /
_CACHE_WRITE / _AUDIO. Sources covered: OpenAI prompt_tokens_details
and Anthropic / Bedrock cache_{read,creation}_input_tokens.
- Render assistant tool_calls on both input and output messages via
MESSAGE_TOOL_CALLS.* (Pydantic-aware, handles ModelResponse choices).
Tool-result input messages also get MESSAGE_TOOL_CALL_ID and
MESSAGE_NAME.
- Render multimodal list-shaped content via MESSAGE_CONTENTS.* (OpenAI
image_url, Anthropic source.{media_type,data} as data: URI). Legacy
MESSAGE_CONTENT write is unchanged.
- Emit SESSION_ID (end_user_id / trace_id), USER_ID (only when not
already set by optional_params.user or model_params.user), and
litellm.{team_id,team_alias,key_alias} from StandardLoggingPayload
metadata.
- Emit llm.response.cost as float from StandardLoggingPayload.response_cost.
- Bedrock / Anthropic passthrough normalization: extract input from
additional_args.complete_input_dict and output from the coerced
provider response so INPUT_VALUE / OUTPUT_VALUE / LLM_INPUT_MESSAGES /
LLM_OUTPUT_MESSAGES are populated. Only runs when call_type contains
"passthrough" / "pass_through".
Tests
- 15 new unit tests covering each addition plus explicit regression
guards (USER_ID overwrite protection, passthrough normalizer scope,
coerce identity for dicts/.get-bearing objects, no spurious cache
emits).
- Existing test_arize_set_attributes count bumped from 26 to 27 to
account for the additional defensive span.kind write (same value,
written twice).
- tests/test_litellm/integrations/arize/: 70 passed (55 baseline + 15
new). tests/test_litellm/integrations/test_opentelemetry.py: 221
passed.
Co-authored-by: Cursor <cursoragent@cursor.com>
* refactor(arize): collapse additive try/except blocks into _safe_emit helper
The additive attribute emitters all share the same shape: run a callable,
swallow any exception to debug log so it cannot blank the span. Hoisting
that pattern into a single _safe_emit(label, fn, *args, **kwargs) helper
removes 5 repeated try/except blocks. Behavior unchanged; arize test
suite still passes (70/70).
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(arize): emit cost under canonical llm.cost.total key
Arize's "Total Cost" column reads the OpenInference-standard
`llm.cost.total` attribute. The previous custom `llm.response.cost`
key never surfaced in the trace list. Now emits both keys (canonical +
legacy) so renderers + any existing consumers both work.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(arize): keep span.kind=LLM for tool-using completions + render tool_calls in Output
A chat completion that passes `tools=[...]` or returns `tool_calls` is still
an LLM call per the OpenInference spec — TOOL is reserved for actual tool
execution. The previous override demoted these to TOOL, breaking Arize's
LLM-scoped dashboards/evals and skewing token/cost analytics for any
tool-using traffic.
Additionally, when an assistant response had no text content but did
request tool calls, `output.value` was set to the empty string so Arize's
"Output" pane rendered blank. Now serializes the tool_calls into a compact
JSON summary in `output.value` (the structured `MESSAGE_TOOL_CALLS.*`
attributes are still emitted unchanged).
Cleanups:
- extract `_get_tool_calls` and `_normalize_tool_call` helpers,
deduplicating the dict-vs-Pydantic + function-dict logic across
`_set_choice_outputs`, `_emit_message_tool_calls`, and the new
`_summarize_tool_calls_for_output`.
- drop redundant late `OPENINFERENCE_SPAN_KIND` write — the defensive
early write is now the single source of truth.
- remove a dead local re-import of `MessageAttributes`/`SpanAttributes`.
Tests: 73 pass (added regression guard asserting span.kind stays LLM for
completions that pass tools AND return tool_calls; existing call_count
assertion restored to 26).
Co-authored-by: Cursor <cursoragent@cursor.com>
* chore(arize): tighten cleanup — fold _get_tool_calls into _safe_get
Two tiny cleanups, no behavior change:
- collapse `_get_tool_calls` to use `_safe_get`, removing a 7-line
hand-rolled dict-vs-attribute fallback that duplicated existing logic.
- trim the `_set_choice_outputs` tool-call summary comment from 4 lines
to 2 (was over-explaining).
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(arize): address Greptile review — drop session_id=trace_id fallback, remove dead code, fix Black
Three Greptile-flagged issues + the Black formatting CI failure.
1. SESSION_ID no longer falls back to trace_id. Previously every span
without an explicit `user_api_key_end_user_id` would have its
session.id set to the per-request trace_id, which creates one
distinct "session" per request and breaks Arize's Session-grouping
analytics. Now SESSION_ID is emitted only when an explicit end-user
identifier exists, and the trace_id is emitted under its own
`litellm.trace_id` key so spans remain filterable by trace.
2. Removed dead `ArizeOTELAttributes.set_response_output_messages`
override. Confirmed zero callers in the entire repo (the live path
is `_set_choice_outputs` via `_set_response_attributes`). The
override was preexisting dead code, but the expansion of
`_set_choice_outputs` in this PR made the divergence misleading.
3. Removed permanently-dead first branch in cache_write detection.
`_safe_get(prompt_token_details, "cache_creation_tokens")` looks
for a key that neither OpenAI's `prompt_tokens_details` nor
Anthropic's payload ever exposes. Now reads straight off `usage`
for `cache_creation_input_tokens`.
4. Reformatted both files under Black 26.3.1 (the version CI uses
via `uv sync --frozen`). Local previously used 24.10.0.
Tests: 74/74 pass in the arize suite (added
`test_arize_does_not_use_trace_id_as_session_id_fallback`).
Combined arize + opentelemetry suite: 295/295 pass.
End-to-end verified live: tool-call still emits `span.kind=LLM` and
JSON tool_calls in `output.value`; `session.id` is now correctly
unset when no end_user_id is provided; `litellm.trace_id` is
populated; Bedrock passthrough input/output unchanged.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(arize): gate passthrough prompt export on message redaction
- Skip the complete_input_dict bridge in _maybe_normalize_passthrough when
should_redact_message_logging() is true, so enabling redaction no longer
leaks raw passthrough prompts into Arize (Veria security finding).
- Split passthrough input/output rendering into helpers to satisfy PLR0915.
- Remove dead call_type assignment (F841).
Validated live against a Bedrock passthrough proxy exporting to Arize:
non-redacted renders the real prompt on litellm_request; global
turn_off_message_logging yields input.value=redacted-by-litellm with the
raw_gen_ai_request child span suppressed and no SSN/marker leakage.
Co-authored-by: Cursor <cursoragent@cursor.com>
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
|
||
|
|
2453936a82
|
Litellm websocket improvements (#29563)
* Add support for websocket via codex * Add model alias and creds support * fix: skip cost tracking for WS session wrapper call types The @client decorator on _aresponses_websocket fires async_success_handler with result=None after the session ends. This triggered cost tracking errors because standard_logging_object is never built for None results. Per-turn costs are correctly tracked by individual litellm.aresponses calls inside the session. The outer session-level logging obj should not attempt cost tracking. Fix: skip _aresponses_websocket and _arealtime call types in deployment_callback_on_success, RouterBudgetLimiting.async_log_success_event, and _PROXY_track_cost_callback. * fix: address Greptile review comments Fix JSON injection: use json.dumps instead of f-string interpolation for model name in WS body. Add 30s timeout for first WS frame to prevent unbounded connection resource tie-up. Restore per-event model override in streaming_iterator; fall back to connection-level model when event omits it. Strengthen regression test: inject alias into kwargs via _update_kwargs_with_deployment mock so the test would fail on un-fixed code. * fix: handle nested response.create format in first-frame model extraction When ?model= is omitted, the first WS frame can carry the model in either flat format (first_event["model"]) or nested format (first_event["response"]["model"]). The flat-only check would silently reject clients using the nested wire format. Mirrors the same two-format logic in _build_base_call_kwargs. * fix: don't force connection-level custom_llm_provider on per-event model overrides If a client sends a different model per response.create turn, litellm needs to re-resolve the provider from that model string. Forcing the connection-level custom_llm_provider would silently route the request to the wrong backend. Only inject custom_llm_provider when the per-event model matches the connection-level model. * refactor: extract WS model extraction into testable function Pull the flat/nested model extraction into _extract_model_from_first_ws_event so tests import and exercise the real function rather than a copy. * fix: compare providers not full model strings in _inject_credentials The model == self.model guard was too strict: same-provider model variants (e.g., vertex_ai/gemini-2.0 -> vertex_ai/gemini-1.5 on one connection) would lose custom_llm_provider, breaking routing when a custom api_base is in use. Compare the provider extracted by get_llm_provider instead, so same-provider variants still inherit the connection-level provider while cross-provider overrides let litellm re-resolve. * style: black formatting * refactor: extract first-frame model resolution to fix PLR0915 (too many statements) * Fix responses WebSocket first-frame validation * fix: classify WS first-frame read errors and clarify cost-skip log Distinguish client disconnects from server errors when reading the responses WebSocket first frame, make the cost-tracking skip log message accurate for session wrappers (which do carry a model), and resolve the connection-level provider once per session instead of on every response.create event. * test: cover WS first-frame read errors and same-provider credential injection Adds regression tests for the still-uncovered responses WebSocket paths: the timeout, invalid-JSON and missing-model branches of _read_ws_model_from_first_frame, plus the provider comparison in ManagedResponsesWebSocketHandler._same_provider and _inject_credentials (same-provider model variants keep the connection provider; cross-provider models re-resolve). * fix(responses-ws): fall back to explicit custom_llm_provider when connection model is unresolvable When a WebSocket session is opened with a custom deployment alias that litellm cannot resolve to a provider, _connection_provider was None, so _same_provider returned False for every resolvable per-event model and the connection-level custom_llm_provider was dropped. Use the explicitly-set custom_llm_provider as the connection provider in that case so same-provider per-event models still inherit it while genuinely cross-provider models continue to re-resolve. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com> |
||
|
|
cc55662e5f
|
fix(vertex): strip output_config.effort for Vertex Claude models that reject it (Haiku 4.5) (#29585)
* fix(vertex): strip output_config.effort for models that reject it Haiku 4.5 on Vertex AI does not support output_config.effort and 400s with "output_config.effort: Extra inputs are not permitted". PR #27074 emptied VERTEX_UNSUPPORTED_OUTPUT_CONFIG_KEYS so effort would forward for Opus/Sonnet 4.6+, but that made the strip unconditional across every Vertex Anthropic model, including ones that don't support it. Claude Code injects effort into its default Messages payload, so `claude --model claude-haiku-4.5` started failing. Make the sanitizer model-aware: drop output_config.effort for models that don't advertise output_config support (or any reasoning effort level) while forwarding it for those that do. The fix covers both the chat-completion and Messages pass-through transformation paths since they share the helper. * chore(vertex): log at debug when dropping unsupported output_config.effort Operators pointing an unregistered Vertex Claude alias that does support effort would otherwise see it stripped with no signal. Debug level keeps it out of normal logs since Claude Code sends effort on every request. |
||
|
|
34293fa80a
|
ci: reproduce default-Windows wheel install to guard MAX_PATH (#29597)
* ci: reproduce default-Windows wheel install to guard MAX_PATH The existing using_litellm_on_windows job installs the project with `uv sync`, an editable source install that never copies package files into a deep site-packages path, so it cannot see the 260-char MAX_PATH overflow that breaks `pip install litellm` on default Windows. The content-filter benchmark fixtures have hit that limit three times (#21941, #22039, #29536), each caught only after release. This adds a guard to the same job that builds the wheel and installs it the way an end user would: into a venv whose site-packages prefix is padded to a realistic worst-case Windows length (~100 chars), then asserts the install completes and litellm imports. Any packaged path long enough to bust MAX_PATH at that prefix is reported up front, so the check is deterministic regardless of the runner's long-path setting, while the real install also covers failure modes a length heuristic cannot (half-unpacked packages, reserved names, case collisions). This commit is the guard only; on the current tree it correctly fails because nine fixtures still exceed the limit. The rename that brings them back under it follows on this branch. * fix(packaging): shorten content-filter benchmark fixtures under MAX_PATH The 10 content-filter benchmark result fixtures used the legacy block_{topic}_-_contentfilter_({yaml}).json naming, up to 176 chars inside the wheel, which busts the Windows 260-char MAX_PATH limit once extracted under a realistic site-packages prefix and aborts `pip install litellm` on default Windows. Rename them to the short {topic}_cf.json scheme that _save_confusion_results already emits today (it splits the label on the em-dash and writes f"{topic}_cf"), matching the insults_cf.json and investment_cf.json files fixed earlier. Re-running the eval suite now regenerates these same short names rather than recreating the long ones. This drops the longest packaged path from 176 to 128, so the guard added in the previous commit goes from red to green with a 32-char margin. * test(windows): tidy MAX_PATH guard per review Close the wheel zip via a context manager rather than leaning on refcount collection, and select the wheel under dist/ by newest mtime so a stale artifact from an earlier build cannot be tested instead of the one just produced. Also pin down the venv-depth formula with a short note: the +2 is the separator joining the venv root to "Lib" plus the trailing separator before the entry, which lands the simulated site-packages prefix at exactly 100 chars. |
||
|
|
53a206a179
|
fix(anthropic/adapter): emit thinking block for reasoning_content-only streaming chunks (#29600)
* fix(anthropic/adapter): open thinking block for reasoning_content-only streaming chunks The /v1/messages streaming content-block classifier (_translate_streaming_openai_chunk_to_anthropic_content_block) only recognized thinking_blocks. OpenAI-compatible reasoning backends (vLLM/SGLang reasoning parsers: DeepSeek-R1, Qwen3, gpt-oss, ...) populate reasoning_content with thinking_blocks=None, so the classifier fell through to a text block. The delta translator already emits thinking_delta for reasoning_content, so those deltas landed inside a text block and Anthropic streaming clients (Claude Code, SDK .stream()) silently dropped the chain-of-thought. Mirror the reasoning_content fallback already present in the non-stream translator and the streaming delta translator so the classifier opens a thinking block. Adds a focused regression test. * fix(anthropic/adapter): reach reasoning_content branch when thinking_blocks attr is absent Delta deletes the thinking_blocks attribute when unset, so the prior nested check was unreachable for reasoning-only chunks (vLLM/SGLang). Make it a sibling elif so the content block is classified as thinking. * test(proxy): stop component-allowlist test leaking DATABASE_URL into xdist peers The component-allowlist test pins throwaway DATABASE_URL/LITELLM_MASTER_KEY values at import time via os.environ so importing proxy_server doesn't need a live database. Those values persisted for the whole pytest-xdist worker, so a sibling test sharing the worker (test_key_rotation_e2e's DB-backed E2E case) saw the leaked sqlite DATABASE_URL, treated it as an available database instead of skipping, and the Prisma engine rejected the non-postgres URL (P1012 -> httpx.ConnectError). Restore the prior environment after the import so the throwaway values never escape the module. --------- Co-authored-by: Tai An <antai12232931@outlook.com> |
||
|
|
48c9fabb26
|
Fix : a2a bugs 030626 (#29566)
* Fix error code and context id injection bug * Add support for all A2A methods * Add logging * address greptile review: relay upstream JSON-RPC errors, move _PASCAL_TO_WIRE to module level, add error path tests * fix(a2a): run pre_call_hook for tasks/resubscribe SSE path to enforce guardrails tasks/resubscribe was returning the raw SSE stream without calling proxy_logging_obj.pre_call_hook, silently bypassing any guardrails configured on the agent. This patch calls pre_call_hook before streaming begins and wires post_call_failure_hook into the SSE generator so errors are logged. Adds a regression test verifying the hook is called. * fix(a2a): use get_async_httpx_client instead of creating httpx clients per request Creating httpx.AsyncClient instances per-request adds ~500ms latency. Switch _forward_jsonrpc and _forward_jsonrpc_sse to use the shared client from get_async_httpx_client(httpxSpecialProvider.A2A). * fix(a2a): forward caller identity headers on task ops; validate push notification URL Two security fixes for task management methods: 1. All task operations (tasks/get, tasks/list, tasks/cancel, tasks/resubscribe, push notification config methods) now forward X-LiteLLM-User-Id and X-LiteLLM-Team-Id headers to the upstream agent, so the agent can scope task access to the authenticated caller. 2. tasks/pushNotificationConfig/set validates the callback URL before forwarding: requires HTTPS and rejects private/loopback/reserved IP ranges and localhost hostnames to prevent SSRF. * Fix A2A task hook and push URL handling * fix(a2a): fix mypy type errors for request_id and header_name dict key types * Fix A2A request id and params forwarding * Forward trace IDs for A2A task calls * fix(a2a): strip client-forwarded X-LiteLLM-* headers before applying authenticated identity A client could send x-a2a-<agent>-x-litellm-user-id in their request and have it forwarded to the upstream agent as an authenticated identity header. Fix: sanitize any X-LiteLLM-* headers from agent_extra_headers before merging, then apply the authenticated identity headers last so they always override client-supplied values. * Fix A2A SSE fallback JSON-RPC error code * Fix A2A SSE error id backfill * fix(a2a): validate both push notification url fields to close SSRF bypass * fix(a2a): widen request_id annotation to match JSON-RPC id call sites * fix(a2a): run post-call streaming hook for tasks/resubscribe so agent guardrails apply tasks/resubscribe returned the raw upstream SSE stream without routing events through the post-call streaming hook, so output guardrails configured on the agent were silently skipped for streaming task subscriptions while every other task method and message/stream applied them. Parse upstream JSON-RPC SSE events and feed them through async_streaming_data_generator, matching message/stream, so guardrails inspect the streamed task content. Adds a regression test that fails when the streamed events bypass the guardrail hook. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com> |