248176112e
39629 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
248176112e
|
feat(ui): add admin flag to disable in-product UI nudges for everyone (#29796)
* feat(ui): add admin flag to disable in-product UI nudges for everyone Admins can now suppress the survey and Claude Code feedback popups for all users via a single disable_ui_nudges UI setting, instead of relying on each user dismissing them individually. * fix(ui): suppress nudges while ui settings are loading Gate nudgesDisabled on the ui-settings loading state so an admin with disable_ui_nudges on doesn't see the survey prompt flash, and the getInProductNudgesCall fetch doesn't fire, on a cold page load before the flag resolves. Falls back to showing nudges if the fetch errors. * test(ui): wrap CreateKeyPage test in QueryClientProvider page.tsx now calls useUISettings (react-query), which needs a QueryClient that layout.tsx supplies in production but the test did not. Add the provider and mock getUiSettings so the query resolves. |
||
|
|
50522157dc
|
docs(security): require a reproduction video for vulnerability reports (#30048) (#30063)
With AI models capable of automated vulnerability discovery now publicly available, we expect a large increase in report volume, much of it unverified. Requiring a video of the exploit running against a live instance raises the bar for submissions and keeps triage focused on reproducible issues. Reports without a video will be closed and reopened if one is added later. Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com> |
||
|
|
5b7063d194
|
fix(mcp): allow team access-group grants in OAuth authorize/token access check (#30041)
* fix(mcp): honor team access-group grants in OAuth authorize/token access check * test(mcp): mock build_effective_auth_contexts in non-admin authorize tests for isolation |
||
|
|
d8fe091938
|
fix(ui/mcp): reset OAuth state on create-server modal close so a prior server's token no longer leaks into the next add-server session (#30000)
* fix(ui/mcp): reset OAuth hook state on modal close so a prior server's token no longer leaks into the next add-server session * fix(ui/mcp): clear in-flight OAuth guard on reset and reset form/tools on modal close so nothing leaks on a parent-driven dismiss |
||
|
|
38edf241a4
|
chore(ui): remove dead App Router route stubs under (dashboard) (#30045)
models-and-endpoints, organizations, and virtual-keys each had a page.tsx route under (dashboard)/ that is not in MIGRATED_PAGES, so the sidebar and deep links never resolve to it and the route is unreachable. Each was a thin wrapper that handed the shared view empty or no-op props (empty modelData with a no-op setModelData, hardcoded empty organizations, no-op setUserRole/setUserEmail), so reaching one would render a degraded page in any case. The real wrapper belongs in the PR that flips each page into MIGRATED_PAGES, written with eyes on it and a test This continues the dead-scaffolding cleanup from #28891. The shared components these wrappers rendered (ModelsAndEndpointsView, OrganizationFilters) stay, since the legacy ?page= switch in app/page.tsx and src/components still import them |
||
|
|
fe60f9d0f1
|
fix(proxy): extend response headers hook to streaming, TTS, image gen, and pass-through (#24232)
* fix(proxy): extend response headers hook to streaming, TTS, image gen, and pass-through * test: mock post_call_response_headers_hook in audio speech route tests |
||
|
|
6ae8a509f0
|
test(ui): data-driven App Router migration E2E smoke (default + server-root-path) (#29974)
* test(ui): add a data-driven App Router migration E2E smoke
Add a growing Playwright smoke for migrated pages: for each segment it deep-links
to the path route, asserts the URL and that the dashboard shell rendered, then
clicks off to a legacy page and asserts navigation still works. Driven by
e2e_tests/fixtures/migratedPages.ts, so adding a page is one line.
Runs in two situations against the same proxy: the default mount (npm run
e2e:migration) and a non-root SERVER_ROOT_PATH mount (npm run e2e:migration:root).
globalSetup now logs in at `${SERVER_ROOT_PATH}/ui/login` so the admin storage
state is valid under a prefix. Seeded with api-reference; append the rest as their
migrations merge.
* test(ui): support headed slow-motion + watch pauses in the migration smoke
Honor SLOWMO in the server-root-path config (the default config already did),
and add an env-gated E2E_WATCH_MS pause so a headed run lingers on each state.
Both are no-ops by default, so CI behavior is unchanged.
* test(ui): make the migration smoke a sidebar-click user journey
Rework the smoke from deep-linking to a real navigation journey: start at the
landing page, click the migrated page in the sidebar (expanding submenus for
nested items), assert the path route rendered, reload it (the check a wrong
server_root_path breaks), bounce to a legacy page and back, and — once two pages
are migrated — navigate directly between two migrated pages. Verifies via URL +
shell render, driven by the same fixture list.
* test(ui): address review on the migration smoke
Escape ROOT and segment before interpolating them into RegExp URL matchers so a
future segment containing regex metacharacters can't silently widen the match.
Make the server-root-path config fail fast when SERVER_ROOT_PATH is unset instead
of silently re-running the default mount and passing without exercising the prefix.
* test(ui): drop unused watch helper and fix stale smoke README
* test(ui): run the migration smoke under a server root path in CI
* test(ui): harden + instrument the server-root-path proxy reboot in CI
* test(ui): run the server-root-path migration smoke as its own CI job
Replace the in-place proxy reboot in e2e_ui_testing with a dedicated
e2e_ui_testing_server_root_path job that boots the proxy once with
SERVER_ROOT_PATH=/litellm, matching how every other proxy variant in the
config gets its own job rather than killing and relaunching the live proxy.
The reboot was failing deterministically: after pkill -9 and relaunch the
prefixed proxy never came back up on :4000 (connection refused), so the smoke
never ran. The readiness step that was supposed to surface the cause could
never reach its boot-log tail because CircleCI runs steps under bash -eo
pipefail and the preceding `curl -sv ... | tail` aborted the step with curl's
exit 7. Booting the proxy as the job's own background step lets any boot crash
land in that step's log instead of being swallowed.
The default e2e_ui_testing job is unchanged aside from dropping the reboot,
prefixed-readiness, and prefixed-smoke steps; the migration smoke still runs at
the root mount there via the default Playwright config.
|
||
|
|
d84499e0f2
|
fix(team): reserve team budget raises for proxy admins on /team/update (#30030)
The caller's PERSONAL max_budget was the wrong yardstick for /team/update: a team's spend ceiling has nothing to do with the admin's own key budget. That comparison was an unintended side effect of reusing _check_user_team_limits() (which exists for the /team/new path) and broke the UI, which re-sends the unchanged budget on every save. New behavior on /team/update for standalone teams: - A team admin (already authorized via _verify_team_access) may freely KEEP or LOWER the team budget, and change models/tpm/rpm, without being gated by their personal limits. - GROWING a team's spend ceiling is a budget-authority action reserved for proxy admins -> 403 for team admins. "Growing" covers both raising max_budget above the team's current finite value and removing the cap entirely (max_budget=null, detected via model_fields_set so an explicit null is distinguished from an omitted field). For a team that currently has no cap, setting a finite value is a restriction and is allowed. - Org-scoped teams remain governed by _check_org_team_limits() (capped by the org budget). Also reverts the #29525 existing_team_max_budget workaround in _check_user_team_limits() back to the create-only form; /team/new still enforces the creator's personal caps. docs(access_control): resolve the contradiction in the team-admin section — team admins can keep/lower the budget and manage rate limits/models, but cannot raise the team budget (proxy-admin only). tests: unit + behavior coverage for raise-blocked, cap-removal-blocked (team admin), raise/removal allowed (proxy admin), uncapped-team restriction allowed, keep/lower/resend allowed, and unchanged create-path guards. Co-authored-by: Cursor <cursoragent@cursor.com> |
||
|
|
51ba6e39cd
|
fix(mcp): load MCP tool configuration tools via the OBO/passthrough-aware GET path (#29960)
* fix(ui): load MCP tool configuration tools via the OBO/passthrough-aware GET path * fix(mcp): admin-only include_disabled_tools so the settings UI shows toggled-off tools * fix(ui): repopulate MCP server edit form when server data loads after mount (OAuth return) * fix(ui): persist MCP OAuth token on save and return to the Settings tab after authorize * fix(ui): scope MCP OAuth callback to the initiating form so create and edit flows don't cross-talk * fix(ui): derive OAuth-return Settings tab via lazy state init instead of setState-in-effect * Fix MCP OAuth edit token handling --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> |
||
|
|
424db6a980
|
feat(azure_ai): add MAI-Image-2.5 image generation support (#29688)
* feat(azure_ai): add MAI-Image-2.5 image generation support Route azure_ai MAI models to /mai/v1/images/generations and map OpenAI size to width/height for the serverless API. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(azure_ai): address MAI image generation review feedback Validate unsupported size values, default width/height independently, add MAI-Image-2.5 pricing, and expand test coverage. @greptileai Co-authored-by: Cursor <cursoragent@cursor.com> * feat(azure_ai): add MAI image edit and expand model cost map Add MAI image edit support with usage normalization for Azure response format, and register MAI-Image-2.5-Flash and MAI-Image-2e pricing in the model map. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(azure_ai): validate MAI edit size by consuming map iterator Greptile: lazy map() never evaluated int() so values like 1024xabc passed through. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(azure_ai): normalize MAI usage in generation response handler Apply normalize_mai_image_usage before building ImageResponse so token-based cost calculation works when Azure returns num_output_tokens fields. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(azure_ai): narrow MAI edit size param type for mypy Co-authored-by: Cursor <cursoragent@cursor.com> * Fix Azure MAI image response handling * Fix MAI image generation base model routing * fix(azure_ai): preserve zero num_output_tokens in MAI usage normalization * fix(azure_ai): wrap MAI generation response JSON parsing in error handling * fix(azure_ai): build MAI image edit URL correctly for /mai/ root bases * fix(azure_ai): build MAI image generation URL correctly for /mai/ root bases --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com> |
||
|
|
92817cb65b
|
changing expires_in default to use actual slack return details (#29951) | ||
|
|
1bbaf1c39d
|
fix(guardrails): read CrowdStrike AIDR identity from both metadata bags (#29991)
Capture user_id and extra_info from metadata or litellm_metadata. The single-bag read dropped identity whenever a request carried a present litellm_metadata field (null or a user-supplied dict), since /chat/completions routes the authenticated identity into metadata while the guardrail read litellm_metadata first |
||
|
|
411bd3da5b
|
feat(vantage): include organization metadata in FOCUS Tags export (#28184)
* feat(vantage): include organization metadata in FOCUS Tags export Join LiteLLM_OrganizationTable when building Vantage/FOCUS export rows so organization_id and organization_alias appear in Tags for org-level filtering. Co-authored-by: Cursor <cursoragent@cursor.com> * test(focus): include api_requests in organization Tags tests FocusTransformer now requires api_requests after staging merge; add the column to test fixtures so integrations CI can run the Tags assertions. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> |
||
|
|
c24a3603d9
|
fix(team-management): delete a team's BYOK models when the team is deleted (#29977)
A team's BYOK models (rows in LiteLLM_ProxyModelTable with model_info.team_id set) were left orphaned when the team was deleted; they lingered in the database and kept showing on the Models + Endpoints page. delete_team now removes them via a new delete_team_models helper that deletes the rows in one transaction and syncs the in-memory router only after that transaction commits, run before the team rows are deleted so a mid-flight failure never leaves the team gone with its models orphaned |
||
|
|
bac2590b39
|
build(deps): bump pyjwt to 2.13.0 and ws override to 8.20.1 (#29982)
Raise the PyJWT floor in pyproject (>=2.13.0,<3.0) and re-resolve uv.lock so the proxy installs 2.13.0 instead of 2.12.0. Bump the ws transitive-version override in the dashboard from 8.19.0 to 8.20.1 and regenerate package-lock; jsdom and openai both dedupe onto the single 8.20.1 copy. Both are routine dependency maintenance bumps to keep pinned versions current. |
||
|
|
f59e4ebc9e
|
fix(ui): show team projects to internal users (#28855)
Allow internal users to fetch their backend-scoped project list so the key creation project dropdown can populate for selected teams. |
||
|
|
dfd6cbc514
|
fix(vertex): propagate Vertex AI metadata in streaming success callbacks (#29899)
* fix(vertex): propagate Vertex AI metadata in streaming success callbacks Streaming calls assembled via stream_chunk_builder were missing vertex_ai_grounding_metadata and vertex_ai_url_context_metadata in standard_logging_object.response. Merge metadata from chunks into the assembled response and mirror non-streaming hidden_params on Gemini chunks. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(vertex): move streaming metadata merge into provider config hook Address review feedback by delegating assembled-stream metadata propagation to VertexGeminiConfig via BaseConfig.apply_assembled_streaming_response_metadata, and only write chunk hidden_params when metadata is non-empty. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(redaction): scrub Vertex provider metadata when message logging is off Clear vertex_ai_grounding_metadata and related fields from standard logging responses and assembled streaming ModelResponse objects so turn_off_message_logging cannot leak prompt-derived web search queries. Co-authored-by: Cursor <cursoragent@cursor.com> * Use assembled model for streaming metadata hook * Fix Vertex metadata redaction bypass in logging callbacks. Scrub Vertex provider fields from litellm_params.metadata.hidden_params during perform_redaction so streaming success_handler merges do not leak prompt-derived metadata when message logging is disabled. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix Vertex streaming metadata from hidden params * fix(vertex): mirror vertex_ai_safety_results on assembled streaming responses The non-streaming transform_response stores safety data under vertex_ai_safety_results, but the streaming path only wrote vertex_ai_safety_ratings. Assembled streaming responses therefore never carried vertex_ai_safety_results, so any consumer reading that field saw a silent difference between streaming and non-streaming calls. Set vertex_ai_safety_results alongside vertex_ai_safety_ratings in the shared stream metadata setter and add it to the assembled metadata field list so it propagates through stream_chunk_builder. * fix(streaming): log provider streaming metadata hook failures instead of swallowing them * refactor(vertex): share single Vertex metadata field tuple across redaction and streaming * refactor(vertex): move Vertex metadata redaction helpers into llms/vertex_ai --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com> |
||
|
|
1c881eee5d
|
fix(fireworks): enable tool calling for glm-5p1 in model cost map (#29697)
glm-5p1 supports native tools on Fireworks; explicit false flags caused drop_params to strip tools and tool_choice before the provider request. Co-authored-by: Cursor <cursoragent@cursor.com> |
||
|
|
9ccda11919
|
fix(team_endpoints): don't block /team/update on unchanged team budget (#29525)
On /team/update for a standalone (no-org) team, _check_user_team_limits() compared the request max_budget against the caller's personal max_budget whenever max_budget was present in the payload. A team admin whose personal budget is lower than the team's budget could not edit any field (tpm_limit, team name, etc.) because the UI re-sends the unchanged max_budget on every update, tripping the personal-budget check. Pass the team's current max_budget into _check_user_team_limits() and skip the personal-budget comparison when the incoming value is unchanged or lower than the team's current budget. Only genuine increases above the team's current budget are still validated against the caller's personal limit, so no over-relaxation. Proxy admins and the org-scoped path are unaffected. Adds two regression tests for the standalone update path (unchanged budget + tpm_limit change, and lowering the budget), both for a caller whose personal budget is below the team budget. Co-authored-by: Cursor <cursoragent@cursor.com> |
||
|
|
a7ecf6b5b1
|
feat(jwt-auth): opt-in fallback to DB team on unresolved JWT claim (#28913)
* fix(jwt-auth): defer to single-team DB fallback on claim mismatch Extends the single-team DB fallback introduced in #26418 to two more cases where it previously could not run: * `find_and_validate_specific_team_id`: when `team_id_jwt_field` is configured and a claim value is present in the token but the team does not exist in the LiteLLM DB (HTTPException 404 from `get_team_object`), return `(None, None)` instead of raising — the auth_builder fallback then attributes the request to the user's single DB team. Only HTTPException is caught; other errors (e.g. "No DB Connected") still propagate. * `find_team_with_model_access`: when none of the `team_ids_jwt_field` groups resolve to a real LiteLLM team, return `(None, None)` instead of raising 403 so the same fallback path runs. If at least one group DID resolve to a team but none granted the requested model, the original 403 is preserved (legitimate access denial — not a claim mismatch). Tracked via the new `any_claim_team_resolved` flag. The strict `is_required_team_id` raise and `enforce_team_based_model_access` raise remain unchanged. Unit tests cover both new soft-fail paths and guard each preserved path (strict required, enforce_team_based, the preserved 403, and the non-HTTPException propagation). Co-authored-by: Cursor <cursoragent@cursor.com> * fix(jwt-auth): narrow HTTPException catch to 404 (greptile review) Address Greptile review comments on #28913: * `find_and_validate_specific_team_id`: re-raise HTTPException when `status_code != 404`, pinning the catch to the "team doesn't exist in db" path documented for `get_team_object`. A future change that introduces a different status code (e.g. 403 for a blocked team) will now propagate instead of silently falling through to the single-team DB fallback. * Add `test_find_and_validate_specific_team_id_non_404_http_exception_propagates` parametrised over 400 / 403 / 500 to lock in the contract. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(jwt-auth): gate claim-mismatch fallback behind opt-in flag The unresolved-team-claim fallback added in the previous commit weakened the strict claim-based authorization contract by default — an authenticated user whose JWT carries a stale or invalid team claim could still consume their single DB team's models/quota via the fallback. Gate both soft-fail paths in `find_and_validate_specific_team_id` and `find_team_with_model_access` behind a new opt-in flag `team_claim_fallback` on `LiteLLM_JWTAuth` (default False). Default-off preserves the pre-existing strict behavior. Operators who intentionally treat IdP team claims as advisory (e.g. machine tokens whose group claims live in a separate namespace from LiteLLM team_ids) opt in via config. Adds two regression tests guarding the default-off behavior. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> |
||
|
|
69a7bdb247
|
fix(model-management): allow deleting a BYOK model after its team is deleted (#29875)
* fix(model-management): allow deleting a BYOK model after its team is deleted A team BYOK model (model_info.team_id set) became undeletable once its team was deleted: POST /model/delete ran can_user_make_model_call, which looked the team up and raised 400 "Team id=... does not exist in db" before the delete could run, so the model lingered on the Models + Endpoints page with no way to remove it. Drop the team-existence prerequisite from the delete path. When the model's team still exists the normal auth check runs unchanged; when it is gone a proxy admin may delete the orphan and any other caller gets a 403. The check is fail-closed, so a missing or errored team lookup can only block the delete or require an admin, never grant a non-admin access. Add/update/health keep their team-existence validation. * refactor(model-management): drop redundant team lookup on model delete Move the orphaned-team handling into can_user_make_model_call behind an allow_missing_team flag instead of pre-checking team existence in delete_model. The endpoint no longer issues its own litellm_teamtable lookup, so deleting a model whose team still exists hits the team table once instead of twice. The auth behavior is unchanged: a proxy admin can delete a model whose team was deleted, any other caller gets a 403, and add/update/health keep the strict "team must exist" validation. |
||
|
|
dfb68a23de
|
feat(galileo): add health check support for UI callback test (#29908)
* feat(galileo): add health check support for UI callback test Register galileo in /health/services so the proxy UI callback connection test works. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(galileo): verify API key via /current_user health check Call Galileo's current_user endpoint so the UI callback test validates credentials against the provider. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(ui): regenerate schema.d.ts for galileo health service Co-authored-by: Cursor <cursoragent@cursor.com> * fix(galileo): return IntegrationHealthCheckStatus from async_health_check Fixes mypy assignment error in health_services_endpoint where response was narrowed to IntegrationHealthCheckStatus from earlier branches. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix Galileo logging to match Langfuse across all endpoint types. Stop skipping ingest when output is empty and log embeddings with a placeholder so embedding, speech, and other non-text responses are recorded like Langfuse. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(galileo): remove unreachable health-check guard and None output sentinel The use_v2_api flag is derived from bool(api_key), so the inner GALILEO_API_KEY check inside the v2 branch could never run; collapse the credential validation into the username/password path with a combined message. _serialize_galileo_output now returns an empty string for None, so _get_galileo_input_output_content always yields a str and the post-call None coalescing guard is no longer needed. * test(galileo): cover async_health_check failure paths and empty model response Add regression tests for the Galileo health check unhealthy branches (missing project id, missing base url, missing credentials, auth failure, and request exception) and for logging a model response with no choices, which now queues an empty output instead of being skipped. --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com> |
||
|
|
32c88ca74f
|
Litellm oss staging 080626 (#29932)
* feat(bedrock_mantle): add SigV4/IAM auth to Responses API route (fixes #29665) (#29788) * feat(responses): add default no-op sign_request to BaseResponsesAPIConfig * feat(responses): call sign_request after body is final, send signed bytes when signed * feat(bedrock_mantle): add SigV4 sign_request via composed BaseAWSLLM (bearer path) * test(bedrock_mantle): cover SigV4 access-key, AssumeRole, body bytes, region/auth consistency * feat(bedrock_mantle): defer auth to sign_request; validate_environment no longer requires bearer * docs(bedrock_mantle): document SigV4 + Bearer auth on Responses route * test(responses): cover fake-stream signing order and mantle bearer arg/env precedence * fix(bedrock_mantle): wrap all botocore credential errors with both-paths guidance * fix(bedrock_mantle): catch specific credential errors, not all BotoCoreError, so STS transport failures are not masked * fix(bedrock_mantle): sign the compact Responses route too, not just create * fix(github-copilot): route per-model on /v1/responses based on model info (#29747) * feat(focus): add GCS destination for FOCUS export (#29751) * test: add failing tests for FocusGCSDestination * feat: add FocusGCSDestination reusing GCSBucketBase auth * feat: register FocusGCSDestination in factory; export from __init__ * fix(focus): preserve GCS_PATH_SERVICE_ACCOUNT when service_account_json not in config * style: apply Black formatting to gcs_destination and tests * style: apply Black formatting to factory.py * fix(bedrock): omit empty additionalModelRequestFields and system from Converse API payload (#29565) Amazon Nova Pro (and other strict Bedrock models) return 400 Malformed input request when additionalModelRequestFields: {} or system: [] are present in the payload. Both fields are optional in CommonRequestObject (total=False) and must be omitted rather than sent as empty structures. Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(proxy): recognize *.cognitiveservices.azure.com as OpenAI-compatible in pass-through cost tracking (#29730) * fix(proxy): recognize *.cognitiveservices.azure.com as OpenAI-compatible Azure OpenAI resources created via the newer "Azure AI Foundry" / Cognitive Services pathway live on `*.cognitiveservices.azure.com` subdomains, not the older `openai.azure.com`. Both are valid Azure OpenAI surfaces in production today. The OpenAI pass-through cost-tracking handler hard-codes only the older hostname in five places (four `is_openai_*_route` methods on OpenAIPassthroughLoggingHandler, plus is_openai_route on PassThroughEndpointLogging). As a result, calls from newer Azure deployments are silently classified as "not an OpenAI route", the dispatch into the cost-tracking handler is skipped, and tokens/cost never get extracted into LiteLLM_SpendLogs — the row gets written with prompt_tokens=0, completion_tokens=0, spend=0, model='unknown'. Reproduced 2026-06-04 against a real Azure OpenAI deployment on `*.cognitiveservices.azure.com` proxied through LiteLLM v1.88.0. Fix: factor the hostname check into a single helper `_is_openai_compatible_host` listing all three recognized surfaces (api.openai.com, openai.azure.com, cognitiveservices.azure.com), and have all five call sites delegate to it. Purely additive — never weakens recognition for the originally-supported hostnames. Adds a test `test_is_openai_route_recognizes_cognitiveservices_azure_com` that exercises all four `is_openai_*_route` static methods against `*.cognitiveservices.azure.com` URLs (positive cases per route + a small cross-route negative to confirm route-specific path matching still works on the new hostname). Out of scope for this PR (separate followup): - `openai_passthrough_handler` calls chat/completions `transform_response` on Responses API payloads (`output:` not `choices:`), which throws inside the dispatch and drops the SpendLogs row entirely. Recognized + tracked separately. * ci: trigger fresh run Empty commit to re-run checks. The previous auth-and-jwt failure was a transient HuggingFace Hub 429 rate-limit hitting tokenizer downloads in tests/proxy_unit_tests/test_custom_tokenizer_bug.py — unrelated to this PR's scope (hostname recognition in pass-through cost tracking). No code change. --------- Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> * fix(responses): preserve forced-function tool_choice name in Responses to Chat transform (#29812) The Responses API forces a specific function with a top-level name ({"type": "function", "name": "X"}), but _transform_tool_choice only handled the nested Chat Completions shape and fell through to returning "required" for the flat form, silently dropping the function name and degrading a forced function call to force-any-tool. Map the flat Responses shape to the nested Chat shape, keeping the "required" fallback when no name is present. * Preserve x-anthropic-billing-header system blocks for first-party Anthropic (#29584) * Preserve x-anthropic-billing-header system blocks for first-party Anthropic PR #20951 strips system blocks beginning with "x-anthropic-billing-header:" for every Anthropic target. That block is how the first-party Anthropic API recognizes Claude Code subscription (OAuth) traffic, so dropping it makes requests that carry only that block, such as the auto-mode tool-safety classifier, fail with a misleading 429 rate_limit_error; normal turns still work because they also carry the "You are Claude Code" identity block. Gate the strip behind should_strip_billing_metadata(), defaulting to False on the first-party AnthropicConfig and AnthropicMessagesConfig so the block is kept, and overridden to True on the providers that reach these transforms and reject the block (Bedrock platform, Vertex, Azure for the chat path; Minimax, Azure, DeepSeek for the messages path). Behavior for those providers is unchanged. * Strip billing header on Bedrock invoke and Vertex messages pass-through Two more subclasses reach the gated strip but inherited keep-by-default. AmazonAnthropicClaudeConfig (Bedrock invoke) calls AnthropicConfig.transform_request, which calls translate_system_message, and VertexAIPartnerModelsAnthropicMessagesConfig (Vertex messages pass-through) calls super().transform_anthropic_messages_request. Override should_strip_billing_metadata() to True on both. Add a parametrized test asserting the flag for every first-party base (False) and provider subclass (True), covering all overrides, plus a translate_system_message regression test for the Bedrock invoke path. * fix(cache): log hashed cache keys (#29890) * fix(ui): save routing groups as list (#29889) * Revert "fix(ui): save routing groups as list (#29889)" (#29928) This reverts commit 9b1f78ffa7a309cabe5e9a7ab5f94d1224d192c9. * feat(parasail): add Parasail as a JSON-configured OpenAI-compatible provider (#29842) * feat(parasail): add Parasail as a JSON-configured OpenAI-compatible provider Registers parasail in the openai_like JSON provider loader with both /v1/chat/completions and /v1/responses support. Parasail's Responses API rejects store:true and any request that omits store, so the loader gains a force_store_false special_handling flag; the parasail entry sets it and the generated Responses config overrides store=false on every call. This keeps callers from hitting "State storage not supported" and matches what Parasail's docs require. Adds the PARASAIL enum value, listing under openai_compatible_providers, provider documentation at docs/my-website/docs/providers/parasail.md, and a focused unit test file under tests/test_litellm/llms/parasail/ that covers JSON registration, chat URL construction, Responses URL construction with PARASAIL_API_BASE override, and the force_store_false regression in both the caller-sent-store=true and caller-omitted cases. * fix(parasail): register in provider_endpoints_support, drop in-repo docs Greptile review feedback. The provider doc belongs in the litellm-docs repo, not this one's docs/my-website tree; removing it here. Adds the parasail entry to provider_endpoints_support.json so the check_provider_folders_documented.py CI check passes (chat_completions and responses true; others false). * fix: normalize Anthropic passthrough server tool usage (#29827) * test(anthropic): cover server_tool_use dict cost tracking * fix: normalize Anthropic server tool usage (cherry picked from commit 982f726bed7d3ec05e463c5dd3d090bebae91d19) * fix: keep server tool usage subscriptable (cherry picked from commit 70280b9b272455b2f974d08bc697f67f929755bf) --------- Co-authored-by: Genmin <joey@joeyroth.com> * fix(proxy): fix typo generic_role_mappoings -> generic_role_mappings in ui_sso.py (#29753) Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> * feat(proxy): add disable_budget_reservation general setting (#27639) (#29493) * feat(proxy): add disable_budget_reservation general setting (#27639) * feat(proxy): register disable_budget_reservation in ConfigGeneralSettings (#27639) * docs(proxy): document disable_budget_reservation concurrency tradeoff (#27639) * ci: re-trigger flaky docker build (prisma generate ECONNRESET) * fix(proxy): warn and document budget enforcement tradeoff when disable_budget_reservation is set (#27639) * feat(gemini_tts): adding support to Gemini TTS languageCode parameters (#29623) * Adding support to Gemini TTS Language Code parameters * Mapping Gemini TTS languageCode param in Docstring * Use snake_case for language_code input keyMapping Gemini TTS languageCode param in Docstring * Restoring files modified under enterprise/litellm_enterprise due to lint/formatting checks --------- Co-authored-by: João Garrido <joaogarrido@google.com> * feat(guardrails): capture user and model metadata in CrowdStrike AIDR (#29517) * fix(proxy): require OpenAI path segment for shared Azure Cognitive Services domains Address Greptile review: the `*.cognitiveservices.azure.com` / `*.openai.azure.com` domains are shared by every Azure Cognitive Service (Speech, Vision, Language, ...), so a hostname-only substring match misclassified non-OpenAI Azure traffic as OpenAI routes. - Replace the substring host test with suffix matching (rejects look-alike domains like cognitiveservices.azure.com.attacker.example). - Add `_is_openai_compatible_url` that requires an OpenAI-style path marker (`/openai/` or `/v1/`) on the shared Azure domains, and use it in PassThroughEndpointLogging.is_openai_route (previously hostname-only). - Add negative tests for Azure Speech/Vision paths and look-alike domains. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix: support Responses input in Redis semantic cache (#29581) * fix: support responses input in redis semantic cache * test: cover redis semantic prompt extraction * test: handle blank redis semantic text fallbacks * chore: remove async cache dead statement * test: cover redis semantic cache miss paths * fix: filter sensitive cache lookup kwargs * chore: rerun ci after huggingface rate limit * chore(ui): regenerate dashboard API types (npm run gen:api) Sync src/lib/http/schema.d.ts with the proxy OpenAPI spec: adds the disable_budget_reservation general-settings field and picks up the RateLimitError docstring reindent. Fixes the gen:api CI drift check. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(bedrock): assert empty additionalModelRequestFields is omitted The Converse transformer now drops an empty additionalModelRequestFields block instead of sending it as `{}`. Update test_bedrock_top_k_param so models without top_k support (llama3) assert the key is absent rather than equal to an empty dict. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Kent <72616338+kingdoooo@users.noreply.github.com> Co-authored-by: codgician <15964984+codgician@users.noreply.github.com> Co-authored-by: Praveen Ghuge <95286176+pghuge-cloudwiz@users.noreply.github.com> Co-authored-by: Roi <roytev@gmail.com> Co-authored-by: shin-berri <shin-laptop@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Liam Scott <liam@uilliam.com> Co-authored-by: abhay23-AI <abhaytrivedi22@gmail.com> Co-authored-by: Ceder Dens <cederdens@gmail.com> Co-authored-by: 冯基魁 <56265583+fengjikui@users.noreply.github.com> Co-authored-by: Kai Huang <kaihuang724@gmail.com> Co-authored-by: rinto <54238243+ririnto@users.noreply.github.com> Co-authored-by: Genmin <joey@joeyroth.com> Co-authored-by: Arnav Bhilwariya <arnavbhilwariya0408@gmail.com> Co-authored-by: Armaan Sandhu <74664101+Ar-maan05@users.noreply.github.com> Co-authored-by: João Garrido <48538534+johngarrido@users.noreply.github.com> Co-authored-by: João Garrido <joaogarrido@google.com> Co-authored-by: Kenan Yildirim <kenan@kenany.me> Co-authored-by: Dávid Balatoni <balcsida@gmail.com> |
||
|
|
1528f43d4c
|
fix(mcp): let non-creator users OAuth into OBO-mode MCP servers from the Tools page (#29867)
* fix(ui): let non-creator users OAuth into OBO-mode MCP servers from the Tools page * fix(ui): clear OBO Tools-tab one-shot on navigate-back and gate on credential-status errors |
||
|
|
1afc41cb29
|
fix(ui): unify migrated-route URLs and migrate the API Reference page (#29953)
* fix(ui): unify migrated-route URLs and cut the API Reference page over to path routing Route all migrated-page navigation through one /ui-prefixed, serverRootPath-aware builder in migratedPages.ts (migratedHref/legacyPageHref/legacyKeyForPathname), replacing the three divergent base-URL constructions that lived in the dashboard layout's withBase, leftnav, and the page.tsx redirect. The previous migratedHref read NEXT_PUBLIC_BASE_URL, which no build sets, so it produced URLs without the /ui prefix the app is served under; every other internal link hardcodes /ui and this now matches that convention. Remove the sidebar's own pushState navigation so the parent (legacy root page or dashboard layout) is the single owner of navigation, fixing the double-navigate that fired when moving between a path route and a legacy ?page= route. Cut API Reference over to its path route: add api_ref -> api-reference to MIGRATED_PAGES and delete its arm from the legacy switch. Visiting /ui/?page=api_ref redirects to /ui/api-reference, the sidebar links to and highlights it, and navigating away returns to the legacy switch. * fix(ui): address review on migrated-page routing Keep the legacy hyphenated ?page=api-reference form working by mapping it to the api-reference route alongside api_ref; the old switch matched both, so a bookmark using the hyphen would otherwise fall through to the Usage default. Add legacyKeyForPathname coverage: a migrated path (with and without trailing slash) resolves to the api_ref sidebar key rather than the alias, a non-migrated path returns null, and a non-root serverRootPath prefix is stripped before matching. * fix(ui): populate serverRootPath from getUiConfig so migrated nav links keep the root path getUiConfig updated proxyBaseUrl but never called updateServerRootPath, so the module-level serverRootPath stayed at its "/" default. Under a custom server_root_path the unified migratedHref/legacyPageHref builders then dropped the prefix and the sidebar produced /ui/api-reference (404) instead of /<root>/ui/api-reference. Adds the missing updateServerRootPath call plus a regression test asserting getUiConfig sets serverRootPath and that migratedHref carries the prefix |
||
|
|
728f057c5e
|
fix(ui): label default key type as "Full Access" on key edit page (#29870)
The key edit page showed the default key type (no allowed_routes restriction) as "Default", while the key creation form already labels the same value "Full Access". Align the edit page to the create form so the two surfaces agree on both the label and its description. |
||
|
|
47b383dbbf
|
fix(ui): keep create guardrail modal open on outside click (#29871)
The create guardrail modal used antd's default maskClosable, so clicking
outside it dismissed the modal and reset every field the user had entered.
Setting maskClosable={false} keeps the modal open; it now closes only via
the explicit close button or Cancel, matching the other form modals in the
dashboard
|
||
|
|
26fe26a5c0
|
fix(ui/model-hub): render provider icons on the public model hub (#29958)
The provider logo base path was relative ("../ui/assets/logos/"). With
trailingSlash enabled, the public model hub is served at /ui/model_hub_table/,
so the browser resolved the base to /ui/ui/assets/logos/ (a doubled /ui/), which
404s every icon. The authenticated hub renders inside the single-level /ui/ SPA
route where the relative path resolves correctly, so only the public hub broke.
Make the base root-absolute so it resolves at any route depth.
|
||
|
|
ff6cea4833
|
refactor(ui): single source of truth for migrated-page routing (#29949)
Consolidate the three hand-synced copies of the migrated-pages map (LEGACY_REDIRECTS in app/page.tsx, MIGRATED_PAGES in the dashboard layout, and MIGRATED_PAGES in leftnav) into one shared module, src/utils/migratedPages.ts, which also owns the migratedHref helper. Delete the unused, incomplete Sidebar2 prototype. No runtime behavior change: the map is still empty and Sidebar2 had no importers, so this is pure deduplication ahead of the per-page App Router migration. Follow-up work will unify the remaining base-URL builders (layout's withBase and page.tsx's redirect) onto migratedHref. |
||
|
|
f5b11b72a6
|
feat(proxy): publish /v2/model/info in Swagger OpenAPI spec (#29900)
* feat(proxy): publish /v2/model/info in Swagger OpenAPI spec Expose the v2 model info endpoint in /docs by removing include_in_schema=False and documenting query parameters used by the admin UI and proxy CLI consumers. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(ui): regenerate schema.d.ts for /v2/model/info OpenAPI docs Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> |
||
|
|
aaf1e2444b
|
feat(ui): include internal routes in the dashboard's generated OpenAPI types (#29885)
The dashboard calls UI-internal proxy routes that the public /openapi.json hides with include_in_schema=False, so they never reached schema.d.ts and could not be typed. The type generator now force-includes those routes when it dumps the spec for openapi-typescript; this mutates a throwaway interpreter only, so the spec the proxy actually serves is unchanged. Regenerates schema.d.ts so 86 internal route families (for example /v2/model/info, /global/spend/*, /config/*, /v2/login, /sso/*) are now typed, with no public route removed. This unblocks migrating the dashboard's data fetching onto the typed $api client. Branch CI note: schema.d.ts is generated; CI regenerates and diffs it via the same gen:api script. |
||
|
|
5e2db7eee4
|
feat(litellm): add models and repository layers (#29686) | ||
|
|
118176f21a
|
refactor(bedrock): build Converse toolSpec via a BedrockToolSpec dict subclass (#29869) | ||
|
|
3448bf79f8
|
fix(ui): default guardrails page to first tab for admins, not submitted (#29872)
The Guardrails page hardcoded defaultActiveKey="submitted", so admins landed on the "Submitted Guardrails" tab (the last of their four tabs) instead of the primary view. The original intent was for non-admins, whose only tab is Submitted Guardrails, to default there; admins should open on their first tab. Make the default role-aware: admins default to the first tab (Guardrail Garden), non-admins keep Submitted Guardrails. |
||
|
|
13924fa1d6
|
feat: standardize rate limit errors with category, rate_limit_type, model, and llm_provider fields (#27687)
* feat(exceptions): add RateLimitErrorCategory + headers/detail fields on RateLimitError
LiteLLM previously surfaced rate-limit conditions through several unrelated
error classes (RateLimitError, FastAPI HTTPException(429), BaseLLMException).
This commit adds the data model needed to consolidate them under a single
class:
* RateLimitErrorCategory enum exposing four categorical values
(vendor_rate_limit, vendor_batch_rate_limit, litellm_rate_limit,
litellm_batch_rate_limit) so callers can switch on the rate-limit source.
* New optional fields on RateLimitError:
- category (defaults to vendor_rate_limit, preserving today's behavior for
every existing call site in exception_mapping_utils);
- headers (preserves retry-after / rate_limit_type / reset_at across the
proxy boundary instead of dropping them on the floor);
- detail (mirrors FastAPI HTTPException.detail so the same instance can be
serialized through both paths).
litellm.RateLimitErrorCategory is re-exported at the package root to match
the existing exception-export pattern.
LIT-2968
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* feat(proxy): add ProxyRateLimitError unifying RateLimitError + HTTPException
Adds a single proxy-side error class that subclasses BOTH
litellm.exceptions.RateLimitError AND fastapi.HTTPException via cooperative
multiple inheritance.
Why both bases:
* Subclassing RateLimitError lets user code catch every rate-limit source
with one 'except RateLimitError' and switch on the new .category field.
* Subclassing HTTPException keeps every existing FastAPI plumbing path (the
isinstance(e, HTTPException) branches in proxy_server.py route handlers,
FastAPI's own dispatcher, and tests asserting pytest.raises(HTTPException))
working without modification, and preserves retry-after / rate_limit_type /
reset_at headers on the wire.
The class declaration order is (HTTPException, RateLimitError) so the MRO
puts HTTPException's no-super-call __init__ ahead of openai's cooperative
__init__ chain — preventing openai.APIError.super().__init__(message) from
landing in HTTPException.__init__(status_code=message).
LIT-2968
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* refactor(proxy/hooks): raise ProxyRateLimitError from budget + iteration limiters
Replaces three bare HTTPException(status_code=429, ...) call sites with
ProxyRateLimitError, which is both a RateLimitError (catchable by category)
and an HTTPException (preserves existing FastAPI serialization). Drops the
now-unused HTTPException import in the iteration / per-session limiters.
LIT-2968
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* refactor(proxy/hooks): raise ProxyRateLimitError from parallel-request limiters
Replaces HTTPException(status_code=429, ...) call sites in the v1 and v3
parallel-request limiters (key/team/user/model/customer rate limits) with
ProxyRateLimitError. Updates the raise_rate_limit_error helper's return type
annotation accordingly.
LIT-2968
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* refactor(proxy/hooks): raise ProxyRateLimitError from dynamic rate limiters
Replaces HTTPException(status_code=429, ...) call sites in the v1 and v3
dynamic rate limiters (project-level TPM/RPM allocation, model-saturation
checks, priority-based limits, fail-closed guards) with ProxyRateLimitError.
The v3 limiter still imports HTTPException for an unrelated bare 'except
HTTPException:' branch.
LIT-2968
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* refactor(proxy/hooks): raise ProxyRateLimitError from batch rate limiter
Replaces HTTPException(status_code=429, ...) in batch_rate_limiter._raise_rate_limit_error
with ProxyRateLimitError tagged as RateLimitErrorCategory.LITELLM_BATCH_RATE_LIMIT
so users can distinguish batch-level throttling (which counts requests/tokens
across an uploaded batch input file before submission) from the generic
key/team/user RPM/TPM limiter.
The HTTPException import is retained because the same module raises
HTTPException for unrelated 403/IO error paths.
LIT-2968
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* test(rate-limit): pin down unified rate-limit error contract
Adds a dedicated test module covering the new RateLimitErrorCategory enum,
RateLimitError.category default + override behavior, ProxyRateLimitError's
dual nature (RateLimitError + HTTPException), and a parametrized regression
guard that asserts every proxy hook module imports the unified class.
The regression guard catches the failure mode the refactor is designed to
prevent: someone re-introducing a bare HTTPException(status_code=429, ...)
in one of the hook modules instead of going through ProxyRateLimitError.
LIT-2968
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* feat(logging): expose rate-limit category via StandardLoggingPayload
Adds an optional 'error_rate_limit_category' field to
StandardLoggingPayloadErrorInformation, populated from the unified
RateLimitError.category attribute (introduced in the previous commits on
this branch).
Why: the .category attribute is reachable off the raw exception today via
getattr(e, 'category', None), but the structured contract that downstream
custom callbacks / loggers / spend log writers consume is the
StandardLoggingPayload. Without this field, a user building custom
rate-limit metrics on top of callback data has to special-case the raw
exception object — which defeats the purpose of the StandardLoggingPayload
abstraction.
The field is None for non-rate-limit exceptions (so consumers can read it
unconditionally without isinstance checks) and is one of the
RateLimitErrorCategory string values otherwise.
LIT-2968
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* test(rate-limit): assert StandardLoggingPayload carries the category
Five tests covering: vendor default, explicit litellm_rate_limit and
litellm_batch_rate_limit values, None for non-rate-limit exceptions, and
None when no exception is provided. Pins down the contract that custom
callbacks can read 'error_information.error_rate_limit_category' off the
StandardLoggingPayload to drive custom rate-limit metrics without ever
reaching for the raw exception.
LIT-2968
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* fix(types): silence mypy [misc] on intentional dual-base attr overlap
mypy emits two [misc] errors on the ProxyRateLimitError class line because
its two bases declare overlapping attributes with related-but-not-identical
annotations:
* status_code: int on starlette HTTPException vs. Literal[429] on openai's
RateLimitError (every openai status-error subclass narrows it the same
way and silences pyright with the same convention).
* headers: Mapping[str, str] | None on HTTPException vs. our Optional[
Dict[str, str]] (the proxy hooks always carry a stringified dict).
Both narrowings are intentional and enforced at construction time. Add a
type: ignore[misc] with an inline explanation rather than relax the
annotations on the parent or change the wire-format guarantees.
LIT-2968
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* test(rate-limit): add direct hook-invocation tests to lift patch coverage
Adds six end-to-end tests that drive each refactored hook past its
limit and assert the unified ProxyRateLimitError is raised with the
correct category and dual-base shape. Complements the
import-shape-only parametrized guard above by actually executing the
new 'raise ProxyRateLimitError(...)' lines so codecov's patch coverage
sees them as hit.
Hooks covered (one test each):
* parallel_request_limiter v1 — direct call to raise_rate_limit_error()
* parallel_request_limiter v3 — direct call to _handle_rate_limit_error
with a fabricated OVER_LIMIT response
* max_iterations_limiter — full async_pre_call_hook with mocked agent
registry, second call exceeds budget=1
* max_budget_limiter — async_pre_call_hook with mocked get_current_spend
* dynamic_rate_limiter v1 — async_pre_call_hook with mocked
check_available_usage forcing available_tpm == 0
* batch_rate_limiter — direct _raise_rate_limit_error call, asserts
category is the batch-specific LITELLM_BATCH_RATE_LIMIT (not the
generic LITELLM_RATE_LIMIT)
LIT-2968
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* fix: guard rate_limit_category extraction with isinstance check
* test(rate-limit): cover remaining hook raise sites for codecov
Adds five more direct hook-invocation tests so every PR-touched line
in the proxy hooks is exercised by tests in tests/test_litellm/, which
codecov measures:
* parallel_request_limiter v1 — check_key_in_limits inline raise
(the second raise site, separate from the raise_rate_limit_error
helper covered earlier)
* dynamic_rate_limiter v1 — RPM raise branch (TPM branch was already
covered)
* dynamic_rate_limiter v3 — parametrized over all three raise sites:
model_saturation_check, priority_model, and the fail-closed
fallback for an unrecognized descriptor_key
* max_budget_per_session_limiter — full async_pre_call_hook with a
mocked agent registry and over-budget cached spend
All 42 tests in test_rate_limit_error_unification.py now pass and
together exercise every changed import + raise line across the eight
refactored proxy hooks.
LIT-2968
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* fix: use computed error_message in ProxyRateLimitError detail
* fix(parallel-request-limiter): drop None from detail; annotate raise_rate_limit_error as NoReturn
The v1 ' raise_rate_limit_error' helper built an unused 'error_message'
variable and then assembled the actual ' detail' via an f-string that
interpolated 'additional_details' verbatim — producing
'Max parallel request limit reached None' when invoked without
arguments (flagged by code review).
Fix the helper to:
- use the constructed 'error_message' as the detail
- annotate the helper as NoReturn since it always raises
- drop the redundant 'raise'/'return' at the two call sites
Add two regression tests covering both the with- and without-
additional_details paths.
LIT-2968
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* fix(proxy/hooks): drop literal 'None' from raise_rate_limit_error detail
The v1 parallel_request_limiter's raise_rate_limit_error helper has a
long-standing bug: it computes a None-guarded 'error_message' string but
then ignores it and emits an f-string that interpolates the raw
'additional_details' arg. Callers that pass no argument get
'Max parallel request limit reached None' as the user-facing detail.
This commit:
* wires error_message into the detail kwarg so the None-guard actually
applies and operators see a clean message;
* changes the return-type annotation from ProxyRateLimitError to NoReturn
(the function always raises) so type-checkers know callers after this
invocation are unreachable.
Greptile P1 + P2 review feedback on PR #27687.
LIT-2968
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* fix(types): demote TypedDict floating string to a # comment
A string literal placed after a field declaration in a TypedDict body is
not a per-field docstring — it's an orphaned string expression Python
discards. Tools like mypy / pyright that inspect TypedDict fields won't
surface that text either.
Move the documentation for error_rate_limit_category to a real comment
so the intent is visible to readers and type-checker tooling without
the misleading docstring framing.
Greptile P2 review feedback on PR #27687.
LIT-2968
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* security(exceptions): do not auto-copy vendor response headers to e.headers
A vendor 429 response can set arbitrary headers (Set-Cookie, CORS
overrides, …). Previously, when RateLimitError was constructed with only
a 'response=' (no explicit 'headers=' kwarg), self.headers fell back to
a copy of response.headers. If a downstream proxy serializer ever
forwarded e.headers to the client, a malicious upstream could inject
browser-interpreted headers for the proxy origin.
Drop the fallback. Only headers passed explicitly via the headers= kwarg
make it onto self.headers (proxy hooks pass retry-after etc. — they
control what's surfaced). Vendor response headers stay reachable on
e.response.headers for callers that explicitly want them.
Today's proxy_server.py route handlers don't actually forward e.headers
on the wire (they construct ProxyException without passing headers), so
no current behavior changes — this is a defensive narrowing so the
fallback can never be turned into a vector when someone wires
e.headers through later.
Veria-AI security review feedback on PR #27687.
LIT-2968
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* test(rate-limit): regression guards for review-pass fixes
Pins down the three review-pass fixes:
* test_parallel_request_limiter_v1_helper_no_additional_details — calls
raise_rate_limit_error() with no args and asserts the detail does NOT
contain the literal string 'None'. Pre-fix, callers got 'Max parallel
request limit reached None'.
* test_rate_limit_error_does_not_auto_copy_response_headers — passes a
vendor httpx.Response with a Set-Cookie header to RateLimitError
WITHOUT an explicit headers= kwarg, asserts self.headers stays None
(no leak), then re-checks that an explicit headers= kwarg DOES
populate self.headers. Vendor headers remain reachable on
e.response.headers for callers that explicitly want them.
* The existing v1-helper test now also asserts the additional_details
string makes it through to the detail.
LIT-2968
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* feat(rate-limit): add orthogonal RateLimitType (requests/tokens/concurrent_requests/budget/max_iterations)
trho's last ask in the LIT-2968 thread: distinguish rate-limit failures by
the dimension that was exceeded, not just by who rate-limited (vendor vs.
litellm). Adds:
- RateLimitType str-enum exposed at `litellm.RateLimitType` with values
requests / tokens / concurrent_requests / budget / max_iterations.
- `rate_limit_type` kwarg on litellm.RateLimitError + ProxyRateLimitError;
None default so existing callers (vendor-429 path in exception_mapping_utils)
remain a no-op.
- StandardLoggingPayloadErrorInformation.error_rate_limit_type so custom
callbacks can split rate-limit failures by cause without parsing free-text
error messages. Mirror to error_rate_limit_category extraction in
get_error_information(); single isinstance(RateLimitError) check covers both.
- map_v3_rate_limit_type() helper to collapse the v3 limiter's internal labels
("requests", "tokens", "max_parallel_requests") onto the public enum so
the v3 limiter and dynamic_rate_limiter_v3 share one mapping. Defensive
None on unknown values rather than silently picking a wrong dimension.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* feat(proxy/hooks): wire rate_limit_type onto every limiter raise site
Each refactored proxy hook now populates rate_limit_type with the dimension
that actually tripped the limit, so downstream consumers (custom callbacks,
prometheus exporters via the StandardLoggingPayload) can split key/team/user
rate-limit failures by cause:
- parallel_request_limiter (v1): detect dimension from current vs. limit in
the post-cache branch (concurrent_requests > tokens > requests, matches the
boolean condition order). Base case (current is None, one limit set to 0)
picks the most-specific zero. raise_rate_limit_error() helper accepts an
explicit rate_limit_type kwarg with CONCURRENT_REQUESTS default (matches
every existing internal call site, including the global-limit branch).
- parallel_request_limiter (v3): forward status["rate_limit_type"] through
map_v3_rate_limit_type() so "max_parallel_requests" → CONCURRENT_REQUESTS
for the public field while the raw v3 jargon stays on the HTTP header for
wire-format backward compat.
- dynamic_rate_limiter (v1): TPM-zero → TOKENS, RPM-zero → REQUESTS. Pass
data["model"] through so callbacks see the model that hit the limit
(addresses the secondary "provider missing" complaint in the original
Slack thread, partially — the model is what dashboards typically split on).
- dynamic_rate_limiter (v3): forward status["rate_limit_type"] via
map_v3_rate_limit_type() at every raise site (model_saturation_check,
priority_model, fail-closed unknown-descriptor guard). Also pass model.
- batch_rate_limiter: limit_type is hard-typed "requests"|"tokens" — map
directly without going through the helper's None branch.
- max_budget_limiter, max_budget_per_session_limiter: BUDGET.
- max_iterations_limiter: MAX_ITERATIONS.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* test(rate-limit): cover RateLimitType enum, hook wiring, and StandardLoggingPayload propagation
27 new tests across five new test classes:
- TestRateLimitType: enum exposed at litellm.RateLimitType, all five values
defined, RateLimitError default is None (vendor 429 path makes no claim
about which dimension), accepts both string and enum forms with
str-coercion guarantee for downstream JSON serializers.
- TestProxyRateLimitErrorType: ProxyRateLimitError default is None, accepts
string or enum, doesn't break existing callers that pass nothing.
- TestMapV3RateLimitType: pins each v3-internal → public-enum mapping
(tokens, requests, max_parallel_requests → concurrent_requests, unknown
→ None) so a future v3 refactor can't silently swap dimensions.
- TestStandardLoggingPayloadCarriesType: the new error_rate_limit_type
field reaches the structured payload for both ProxyRateLimitError and
plain RateLimitError, is None when unspecified, and is None for
non-rate-limit exceptions (symmetric with error_rate_limit_category).
- TestProxyHooksWireTypeCorrectly: drives the actual raise sites in the
v1 parallel_request_limiter helper, the v3 _handle_rate_limit_error
(both "tokens" and "max_parallel_requests" paths), and the batch
limiter (both tokens and requests paths) — coverage tools see the new
rate_limit_type= kwargs as exercised, not just the import shape.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* test(rate-limit): cover _coerce_message branches and v1 dimension detection
Drives the patch coverage on the new orthogonal RateLimitType wiring up
to (or close to) 100% on the touched files.
ProxyRateLimitError._coerce_message — was 22% covered, now 100%:
* nested {error: {message}} dict
* nested {message: {message}} dict (alt key)
* dict without 'error'/'message' keys → JSON dump fallback
* non-JSON-serializable dict value → str() fallback
* non-string non-mapping detail (int) → str() coercion
v1 parallel_request_limiter dimension detection — was 0% covered, now
exercised across 6 parametrized cases:
* check_key_in_limits else-branch: current at concurrent / TPM / RPM cap
→ asserts rate_limit_type is concurrent_requests / tokens / requests.
* check_key_in_limits base case (current is None): max_parallel_requests
/ tpm_limit / rpm_limit set to 0 → asserts the most-specific zero
attribution wins per the helper's order.
LIT-2968
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* feat(proxy/hooks): add ProxyHTTPRateLimitError + provider resolver
Introduces a small helper layer used by every proxy-side rate-limit
hook so that the 429 they raise carries a populated llm_provider /
model — instead of an empty exception.llm_provider that downstream
loggers (Prometheus failure metric, observability callbacks) read as
'no provider attribution'.
ProxyHTTPRateLimitError inherits from both fastapi.HTTPException
(so the proxy server still renders it as a 429) and
litellm.exceptions.RateLimitError (so isinstance checks and
PrometheusLogger._get_exception_class_name pick up llm_provider).
We deliberately don't call RateLimitError.__init__ — it constructs
an httpx.Response we don't need and would just add failure surface;
attribute parity is what downstream consumers care about.
resolve_llm_provider_for_rate_limit() wraps litellm.get_llm_provider
defensively. Internal limiter hooks fire from async_pre_call_hook —
well before get_llm_provider runs anywhere else in the request
lifecycle — so we have to call it ourselves at raise time. If the
model is missing or unparseable (alias, router-only model) we fall
back to llm_provider='litellm_proxy' rather than letting a second
exception leak out and break the request path.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* fix(proxy/hooks): populate llm_provider on parallel-request 429s
Both v1 and v3 parallel-request limiters fired bare HTTPException(429)
from inside async_pre_call_hook. The downstream Prometheus failure
metric reads exception.llm_provider via _get_exception_class_name —
the empty value showed up as exception_class='HTTPException' and
left model_id='None' on the time series.
Threads requested_model through every raise site in:
* parallel_request_limiter.py:
- check_key_in_limits (the per-key/per-model/per-user/per-team/
per-customer over-limit path)
- raise_rate_limit_error (zero-limit + global_max_parallel_requests
paths) — now takes an optional requested_model kwarg
* parallel_request_limiter_v3.py:
- _handle_rate_limit_error (the OVER_LIMIT translator), called
from both the should_rate_limit pre-check and the TPM
reservation path
Resolved via resolve_llm_provider_for_rate_limit so unknown / missing
models silently fall back to llm_provider='litellm_proxy' instead of
breaking the request path with a second exception.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* fix(proxy/hooks): populate llm_provider on dynamic-rate-limit 429s
Same plumbing change as the parallel limiters, applied to both
dynamic_rate_limiter (v1) and dynamic_rate_limiter_v3:
* v1: TPM-zero and RPM-zero paths in async_pre_call_hook now resolve
data['model'] -> (model, llm_provider) once and pass it into both
raises.
* v3: All three raise sites in _check_rate_limits — the
model_saturation_check enforced raise, the priority_model
enforced raise, and the fail-closed unknown-descriptor branch —
now attribute the 429 to the actual provider.
Falls back to llm_provider='litellm_proxy' when the model can't be
resolved.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* fix(proxy/hooks): populate llm_provider on batch-rate-limit 429s
batch_rate_limiter._raise_rate_limit_error now takes a
requested_model kwarg threaded from data['model'] in
_check_and_increment_batch_counters. The batch-creation 429 is what
gets raised when the input file's tokens/requests count would push
the per-key TPM/RPM window over its limit.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* fix(proxy/hooks): populate llm_provider on budget/iterations 429s
Final batch of internal raise sites — the user/session-budget and
max-iterations hooks. Same pattern: resolve data['model'] once at
raise time, attach to ProxyHTTPRateLimitError so Prometheus and
observability callbacks can attribute the 429.
Hooks updated:
* max_budget_limiter (per-user max_budget exceeded)
* max_iterations_limiter (per-session agent iteration cap)
* max_budget_per_session_limiter (per-session dollar cap)
All three fall back to llm_provider='litellm_proxy' when data['model']
is missing or unparseable. Drops the now-unused HTTPException import
from each module.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* test(proxy/hooks): pin provider field on internal rate-limit 429s
Regression coverage for the 'provider field missing' bug across every
proxy-side rate-limit hook + the helper layer:
* ProxyHTTPRateLimitError class shape (HTTPException + RateLimitError,
dict-detail stringification, None-provider normalization).
* resolve_llm_provider_for_rate_limit happy paths
(gpt-4o-mini, anthropic/..., bedrock/...) plus all three fallback
branches (None, '', unknown name) plus a 'get_llm_provider raises'
case that asserts we swallow the secondary exception.
* For each limiter (parallel v1/v3, dynamic v1/v3, batch,
max_budget, max_iterations, max_budget_per_session): assert the
raised exception is a RateLimitError carrying the resolved
model + llm_provider, and a sibling test that asserts the
fallback path returns 'litellm_proxy' without leaking a second
exception.
* Two PrometheusLogger._get_exception_class_name pins so the
Prometheus failure metric label flips from 'HTTPException' to
'Openai.ProxyHTTPRateLimitError' (or 'Litellm_proxy.*' on
fallback) — that's what dashboards consume.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* perf(proxy/hooks): defer provider resolution to over-limit branches
* fix: use error_message in raise_rate_limit_error to avoid literal 'None' in detail
* Consolidate rate_limiter_utils imports in dynamic_rate_limiter
* fix(proxy): set num_retries/max_retries on ProxyHTTPRateLimitError
ProxyHTTPRateLimitError inherits from RateLimitError but did not call
RateLimitError.__init__, so num_retries/max_retries were never set.
When Starlette's HTTPException lacks __str__, MRO falls through to
RateLimitError.__str__, which unconditionally reads these attributes
and raises AttributeError during logging/traceback formatting.
Initialize them to None defensively.
* fix(mypy): silence base-class status_code conflict on ProxyHTTPRateLimitError
HTTPException declares 'status_code: int' while openai.RateLimitError
(via APIStatusError) declares 'status_code: Literal[429] = 429'. Mypy
flags the multi-base override as [misc] in CI lint. The runtime semantics
are fine (we set self.status_code in __init__), so silence the
class-level annotation conflict with a targeted ignore.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* fix: annotate batch limiter _raise_rate_limit_error as NoReturn
* feat(prometheus): rate-limit category/type labels + exception_class back-compat (follow-up to #27687) (#27706)
* feat(prometheus): add rate_limit_category and rate_limit_type labels
Adds two new labels to litellm_proxy_failed_requests_metric so dashboards
can split 429s by rate-limit source (vendor vs. litellm-internal) and by
the dimension that was exceeded (requests/tokens/concurrent_requests/
budget/max_iterations) without parsing free-text error messages.
Closes the Prometheus side of LIT-2718. The unified RateLimitError.category
and .rate_limit_type fields landed in PR #27687 but were only surfaced on
StandardLoggingPayload (custom-callback channel); this exposes them on
the metric label set as well.
Both labels are populated only when the underlying exception is a
litellm.RateLimitError; non-rate-limit failures keep them empty.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* feat(prometheus): populate rate-limit labels + preserve exception_class back-compat
Two coupled changes in the Prometheus integration:
1. async_post_call_failure_hook now extracts the new RateLimitError
.category / .rate_limit_type fields (added in PR #27687) via a
_extract_rate_limit_labels helper and forwards them through
UserAPIKeyLabelValues onto litellm_proxy_failed_requests_metric.
Empty for non-rate-limit failures.
2. _get_exception_class_name special-cases ProxyRateLimitError and
keeps emitting 'HTTPException' for the exception_class label.
Without this shim, ProxyRateLimitError (which multi-inherits from
HTTPException + RateLimitError) would silently flip the label
from 'HTTPException' (the historical value for proxy-side 429s)
to 'ProxyRateLimitError', breaking existing dashboards / alerts
that key off exception_class='HTTPException'. Distinguishing
vendor vs. litellm 429s is now the job of the new
rate_limit_category label.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* test(prometheus): cover rate-limit labels and exception_class back-compat
Adds 19 tests across:
- enum / label-list registration
- _extract_rate_limit_labels for vendor RateLimitError, ProxyRateLimitError,
non-rate-limit and None inputs (incl. parametrized over every
RateLimitErrorCategory x RateLimitType combo)
- _get_exception_class_name back-compat: ProxyRateLimitError keeps the
legacy 'HTTPException' string while vendor RateLimitError keeps the
historical 'Provider.ClassName' format
- end-to-end through async_post_call_failure_hook with both
ProxyRateLimitError and vendor RateLimitError, asserting both new
labels populate and exception_class stays back-compat
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* fix(prometheus): tolerate missing fastapi in lazy ProxyRateLimitError import
Address greptile feedback:
- async_post_call_failure_hook docstring: drop the stale labelnames listing
and reference PrometheusMetricLabels.litellm_proxy_failed_requests_metric
as the source of truth so the doc cannot drift from the actual labelset.
- _get_exception_class_name: guard the lazy ProxyRateLimitError import with
ImportError so router-side fallback callsites don't blow up in non-proxy
installs that don't have fastapi (a transitive dep of
proxy.common_utils.proxy_rate_limit_error). Behavior is unchanged when
fastapi is available.
Also fix the existing enterprise callback test that asserted the old
labelset on litellm_proxy_failed_requests_metric — it now expects the new
rate_limit_category / rate_limit_type labels populated for vendor 429s.
---------
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* fix(bugbot): simplify rate-limit label coercion + guard None detail
- prometheus.py _extract_rate_limit_labels: RateLimitError.__init__ already
normalizes category/rate_limit_type to plain str, so the getattr(.value)
+ isinstance dance was dead code. Reduce to str(value) if not None.
- proxy_rate_limit_error.py _coerce_message: short-circuit None to ''
instead of falling through to str(None) = 'None', which produced the
literal message 'litellm.RateLimitError: None'.
* fix(rate-limit): surface unified category/type fields on BudgetExceededError
The most common budget cap (virtual-key max_budget enforcement in
auth_checks.py) raises litellm.BudgetExceededError, a bare Exception
subclass that bypassed the unified rate-limit error class introduced
by PR #27687. Custom callbacks reading
StandardLoggingPayload.error_information saw category=None and
rate_limit_type=None for these 429s, missing the most common budget
case (team / org / end-user budgets all hit the same code path).
Surface the fields off BudgetExceededError as plain attributes:
- category = RateLimitErrorCategory.LITELLM_RATE_LIMIT
- rate_limit_type = RateLimitType.BUDGET
- llm_provider = "" (or caller-supplied)
Switch get_error_information and _extract_rate_limit_labels from
isinstance(RateLimitError) gating to duck-typed attribute reads,
guarded by membership in the rate-limit enums so unrelated third-party
exceptions exposing a .category attribute can't leak garbage values
into the payload.
This is strictly additive: BudgetExceededError keeps its bare-Exception
base class, so `except BudgetExceededError:` handlers keep firing and
`except RateLimitError:` does not start catching budget errors.
* fix(rate-limit): validate enum membership at duck-typed read sites + enrich BudgetExceededError llm_provider
Two follow-ups uncovered during the second QA pass on PR #27687:
1. Guard third-party `.category` / `.rate_limit_type` attribute leakage.
The duck-typed read in `get_error_information` and
`_extract_rate_limit_labels` would forward any string attribute named
`category` / `rate_limit_type` on an unrelated third-party exception
into the StandardLoggingPayload and Prometheus labels — silently
mislabeling custom-callback payloads and blowing out Prometheus label
cardinality. Add `validate_rate_limit_category` /
`validate_rate_limit_type` helpers that gate on the documented enum
value sets; non-matching values are dropped to None.
2. Enrich BudgetExceededError.llm_provider from request_data.
Budget checks live in tenant-scoped helpers (key / team / org / tag /
end-user / project) that don't see the request model, so the
BudgetExceededError they raise carried llm_provider="" — leaving
custom-metrics consumers without provider attribution for the most
common 429 case. Resolve it once at the central
UserAPIKeyAuthExceptionHandler seam, before post_call_failure_hook
fires, so the StandardLoggingPayload the callback sees has the same
provider attribution as RPM/TPM 429s.
Regression tests pin both: 4 leakage tests + 4 enrichment tests. The
leakage tests would fail under the pre-validation version of either read
site; the enrichment tests would fail if the handler skipped the
resolver call.
* fix(rate-limit): resolve router model_name aliases to real provider (#27914)
* fix(rate-limit): resolve router model_name aliases to real provider
For nearly every real LiteLLM proxy deployment the request model is a
router model_name alias (e.g. 'tpm-locked' -> litellm_params.model:
openai/gpt-4o-mini), and 'litellm.get_llm_provider' doesn't know about
router aliases — it raises 'LLMProviderNotProvidedError'. The resolver
then fell through to the defensive 'litellm_proxy' fallback, so the
'llm_provider' field this PR adds was effectively always
'litellm_proxy' in the field, defeating its purpose for the most common
proxy configuration.
Add a router-alias fallback step: when 'get_llm_provider' raises, scan
the active 'llm_router.model_list' for a deployment whose 'model_name'
matches the request model and resolve from its 'litellm_params.model'
instead. If multiple deployments share the same alias (load-balancing
case) the first one wins — every deployment under one alias should
agree on provider in any sensible config, and 'first' is deterministic
so the Prometheus label stays stable.
Defensive throughout: an uninitialized router, a malformed deployment,
a 'litellm_params.model' that itself fails 'get_llm_provider' — every
branch falls through to the existing 'litellm_proxy' fallback rather
than letting a secondary exception escape and mask the rate-limit
error we're trying to surface.
Tests:
- test_router_alias_resolves_to_underlying_provider: alias
'tpm-locked' -> 'openai/gpt-4o-mini' produces provider='openai',
model='gpt-4o-mini'.
- test_router_alias_with_multiple_deployments_uses_first.
- test_router_alias_unknown_falls_back.
- test_router_alias_with_malformed_deployment_falls_back.
- Existing fallback test updated to also stub
'litellm.proxy.proxy_server.llm_router' so it exercises the
full 'no resolution anywhere' path.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* fix(rate-limit): harden router alias resolver + test isolation
- Wrap _resolve_provider_from_router_alias loop in top-level try/except so
a non-iterable model_list / unexpected deployment shape can't escape and
mask the 429 with a 500.
- Type-check litellm_params before .get() to handle non-dict truthy values.
- Patch llm_router=None in the parametrized fallback test so a router left
by another test in the session can't redirect the unknown-model path.
---------
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* fix(bugbot): preserve "BudgetExceededError" Prometheus label
Adding llm_provider to BudgetExceededError (so callbacks get provider
attribution from StandardLoggingPayload) made the provider-prefix step in
_get_exception_class_name silently flip the label from "BudgetExceededError"
to e.g. "Openai.BudgetExceededError", breaking dashboards keyed on the
historical value.
Short-circuit BudgetExceededError in _get_exception_class_name the same way
ProxyRateLimitError already is. Provider/category attribution still lands on
the new rate_limit_category / rate_limit_type labels.
* test: fix invalid 'rpm' rate_limit_type in v3 limiter test mocks
The v3 rate limiter only emits 'requests', 'tokens', or
'max_parallel_requests'. Using 'rpm' caused map_v3_rate_limit_type to
return None, leaving the expected RateLimitType.REQUESTS untested.
Co-authored-by: Yassin Kortam <yassin@berri.ai>
* fix(bugbot): hoist provider resolver + opt-in prom rate-limit labels
- dynamic_rate_limiter.py: hoist resolve_llm_provider_for_rate_limit
above the TPM/RPM if/elif so the lookup runs once per request, matching
the pattern in dynamic_rate_limiter_v3.py.
- prometheus.py: gate the new rate_limit_category / rate_limit_type
labels on litellm_proxy_failed_requests_metric behind
litellm.prometheus_emit_rate_limit_labels (default False). Mirrors the
existing prometheus_emit_stream_label opt-in. Preserves the metric's
pre-unification label set so existing dashboards / recording rules
keep matching after upgrade; operators can enable the new labels once
downstream consumers include them.
- Tests updated: default-off back-compat case, opt-in path enables the
flag before asserting label presence.
* fix: stabilize prometheus label sets and drop redundant model normalization
- Cache PrometheusLogger.get_labels_for_metric per metric_name so that
the label set used to construct counters at __init__ time stays in
sync with the label set used at increment time, even if module-level
toggles like prometheus_emit_rate_limit_labels or
prometheus_emit_stream_label are flipped at runtime. Without this,
toggling these flags after the logger was created would cause
ValueError from prometheus_client because the runtime labels would
not match the counter's declared labelnames.
- Drop redundant 'model or ""' guard in ProxyRateLimitError.__init__
where model is already normalized one step earlier.
Co-authored-by: Yassin Kortam <yassin@berri.ai>
* perf(dynamic_rate_limiter): only resolve provider when rate limit hit
Co-authored-by: Yassin Kortam <yassin@berri.ai>
* test(prometheus): clear cached metric labels after toggling rate-limit flag
The PrometheusLogger caches each metric's label set at construction
time so that labels used at counter.labels(...) time stay consistent
with the labels the metric was registered with. The enterprise
async_post_call_failure_hook test toggles
litellm.prometheus_emit_rate_limit_labels = True AFTER the fixture
has already built the logger, so without invalidating the cache the
rate_limit_category / rate_limit_type labels never reach the mocked
counter and the assert_called_once_with check fails.
Co-authored-by: Yassin Kortam <yassin@berri.ai>
* test: fix CI failures from prom label cache + flaky time-window assertion
PrometheusLogger.get_labels_for_metric now caches the per-metric label
set at first read so the labels passed to counter.labels(...) stay in
lock step with the labels the counter was registered with. This broke
two existing test patterns:
- test_prometheus_labels.py: tests bind the real method onto a
MagicMock, but MagicMock auto-creates a Mock for _cached_metric_labels
whose .get(...) returns a truthy Mock — treated as a populated cache
and returned as the label set, producing empty filtered labels and
KeyError on labels["requested_model"] / ["route"]. Seed real {}
containers for _cached_metric_labels and label_filters before binding.
- test_prometheus_logging_callbacks.py::test_set_team_budget_metrics_with_custom_labels:
the fixture builds the logger before the test monkeypatches
litellm.custom_prometheus_metadata_labels, so the cached label set
never picks up the new metadata labels. Clear the cache after the
monkeypatch (same pattern already used for the rate-limit toggle in
test_async_post_call_failure_hook).
UI: view_logs/index.test.tsx "Last Minute" window assertion is off by
one at the minute boundary. start_date is floored to the minute, so the
dropped sub-minute fraction can push the truncated-seconds diff up to
(minMinutes+1)*60 exactly when the click lands near a minute rollover.
Switch the upper bound to toBeLessThanOrEqual.
* feat(otel-v2): surface rate_limit_category + rate_limit_type on failed LLM-call spans
PR #28909 introduced the typed v2 OTel engine that builds spans from
StandardLoggingPayload, with SpanError carrying error_type + message and
the genai mapper stamping error.type onto every failed LLM-call span.
This PR's earlier commits added error_rate_limit_category and
error_rate_limit_type to the same StandardLoggingPayload.error_information
the v2 engine reads — but neither field reached a span attribute, so v2
OTel traces stayed opaque about *why* a 429 fired (vendor vs litellm,
RPM vs TPM vs concurrent vs budget vs max_iterations) even after the
custom-callback and prometheus surfaces gained that decomposition.
Three coupled changes:
1. semconv.py: add LiteLLM.ERROR_RATE_LIMIT_CATEGORY /
LiteLLM.ERROR_RATE_LIMIT_TYPE under the litellm.* vendor namespace
(no GenAI semconv equivalent exists for who-rate-limited /
which-dimension).
2. payloads.py: extend SpanError with rate_limit_category +
rate_limit_type, populated by _parse_error() from the same
error_information.error_rate_limit_* fields the custom-callback
channel and prometheus rate_limit_category / rate_limit_type labels
read. Single source of truth across all three observability surfaces.
3. mappers/genai.py: stamp the two attributes on the LLM-call span when
present. drop_none guarantees they stay absent (not 'None') for
non-rate-limit failures so trace consumers can read them
unconditionally.
Three regression tests in test_otel_v2_emitter.py pin: a vendor /
litellm-internal RateLimitError lands category=litellm_rate_limit +
rate_limit_type=requests on the span; a BudgetExceededError lands
rate_limit_type=budget; a non-rate-limit failure (BadRequestError)
keeps the rate_limit_* attributes absent. Mutation-tested against
reverting either the SpanError extension or the _parse_error read site
— both new tests fail under either mutation.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* test: align prometheus user-budget + logs quick-select tests with merged code
The merge into this branch left two test patterns out of step with the code
they exercise.
test_set_user_budget_metrics_includes_user_email_and_alias_labels_when_opted_in
flipped litellm.prometheus_user_budget_label_include_email_alias after the
fixture had already built the PrometheusLogger. get_labels_for_metric now
snapshots each metric's label set at construction time, so the runtime flip
no longer reached the cached labels. Enable the flag before constructing the
logger, matching how the proxy applies config at startup.
view_logs/index.test.tsx referenced uiSpendLogsCall and moment without
importing them, and the merged index.tsx now fetches through
useLogFilterLogic (the hook the file stubs out) rather than calling
uiSpendLogsCall directly. Add the imports and restore the real hook for the
Quick Select window assertions so the call is actually observed.
* refactor(otel/v2): drop rate-limit decomposition from the LLM-call span
Proxy-side rate limits (litellm_rate_limit, budget, max_iterations) are
rejected at the gate before any upstream call, so async_post_call_failure_hook
tags the synthetic failure log with LITELLM_LOGGING_NO_UPSTREAM_LLM_CALL and the
v2 OTel logger never opens an LLM-call span for them; the
litellm.error.rate_limit_category / litellm.error.rate_limit_type attributes
were dead for exactly the cases they were meant to surface. The only failure
that does open an LLM-call span carrying a RateLimitError is a vendor 429, where
rate_limit_type is always None and the category just restates
error.type=RateLimitError.
The decomposition still reaches downstream consumers through
StandardLoggingPayload.error_information.error_rate_limit_* and the prometheus
rate_limit_category / rate_limit_type labels, both unchanged.
Removes the SpanError fields, the _parse_error reads, the genai mapper
attributes, the semconv keys, and the three span tests that asserted a scenario
that never reaches the mapper in production.
* fix(batch_rate_limiter): map max_parallel_requests to concurrent_requests
* refactor(prometheus): drop transitive fastapi import from _get_exception_class_name
Read the legacy exception_class label from a prometheus_exception_class_name
marker on ProxyRateLimitError instead of importing the proxy module, keeping
the integrations layer free of a transitive fastapi dependency.
* chore(ui): sync schema.d.ts with unified rate-limit error spec
The ProxyRateLimitError docstring flows into the proxy OpenAPI spec's 429
response description, so the generated dashboard types were out of sync.
Regenerated via npm run gen:api (Check UI API Types Sync).
---------
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
|
||
|
|
7bfce053a9
|
fix(ui): make workflow runs page fill full width (#29868)
The Workflow Runs page rendered its table at roughly a quarter of the available width. Its root container is a flex child of the dashboard content row but set only padding, min-height and background, so with no width it shrank to the table's natural content size. Sibling pages (logs, memory) fill the area with a full-width root; mirror that by setting width 100% on the container. Fixes LIT-3636 |
||
|
|
f31d059aa3
|
feat(ui): add budget duration to edit team member form (#29717)
* feat(ui): add budget duration to edit team member form Editing a team member created a member budget with no duration, so the budget never reset. This threads a budget reset period through the edit flow end to end and reuses the shared duration dropdown so the options stay in sync with the rest of the UI. Resolves LIT-2651 * fix(proxy): validate member budget_duration and persist clears Reject budget_duration values that can't be parsed, are non-positive, or overflow date math before any write, so a bad value can't be persisted and later crash the budget reset job. Clearing the budget duration in the edit-member form now sends null and clears the column end to end, so the dropdown's clear control reflects a real change instead of being a no-op * chore(ui): regenerate schema.d.ts for member budget_duration Adds budget_duration to TeamMemberUpdateRequest/Response in the generated dashboard types so the Check UI API Types Sync gate passes |
||
|
|
aeb55e7a11
|
fix(mcp): highlight MCP cards red when the logged-in user is missing per-user env vars (#29856)
* fix(mcp): flag missing per-user env vars on the card for every accessible server The dashboard MCP card grid lists servers via the registry-backed manager (get_all_mcp_servers_unfiltered for admins in view_all mode, the allowed-context aggregation otherwise), but the per-user env-var status endpoint that drives the red "user fields missing" highlight resolved servers through the much narrower get_all_mcp_servers_for_user, which only returns servers explicitly granted on the calling key. An admin's dashboard session key carries no per-server MCP grant, so the status feed came back empty and the card never turned red even when the logged-in user had not filled in their required variables. Both surfaces now share a single _resolve_accessible_mcp_servers helper, so the status feed is computed over exactly the cards the user sees. The helper returns servers unredacted; the status endpoint needs the raw env_vars and still only ever reports is_set booleans, never the stored secret values. * test(mcp): drop dead get_all_mcp_servers_for_user patch from view_all regression test The bulk status endpoint resolves servers through _resolve_accessible_mcp_servers now, so the old get_all_mcp_servers_for_user patch in the admin view_all regression test is never hit. Removing it keeps the test honest about which code path it exercises. |
||
|
|
d61f7747c0
|
feat(bedrock): forward strict and additionalProperties to Converse toolSpec (#29814)
* feat(bedrock): forward strict and additionalProperties to Converse toolSpec Bedrock Converse supports strict in toolSpec since 2026-02, but _bedrock_tools_pt only whitelisted type/properties/required/name/description, so strict: true was silently dropped and Claude-on-Bedrock ignored enum constraints that GPT and direct-Anthropic honored. Forward strict from the OpenAI function and additionalProperties from the schema (Bedrock requires the latter alongside strict), passing each only when present. https://claude.ai/code/session_01WQjWd8NfUB3vxERwudbHkv * fix(bedrock): only forward strict tool schemas to Claude on Converse Nova, Llama and GPT-OSS on Bedrock reject the strict field (BedrockException 'This model doesn't support the strict field'), and the GPT-OSS request-body test asserts strict/additionalProperties are stripped. Forwarding them to every model broke the llm_translation suite, so gate the forwarding on the anthropic base model since only Claude honours strict tool schemas on Bedrock. |
||
|
|
273855b4e2
|
fix(responses-bridge): map system-only chat request to system input item (#29817)
System-only chat requests mapped the system message to instructions and left input=[], which OpenAI's Responses API rejects (it also rejects input=""). When no other messages are present, carry the system message as a role:"system" input item (single copy, correct role) instead of leaving input empty. Mirrors the existing handling of non-string system content. Fixes Open WebUI new-conversation failures on mode:responses Codex models. Co-authored-by: Cursor <cursoragent@cursor.com> |
||
|
|
68d67212cd
|
fix: 400 on Anthropic context overflow; seed identity on failed auth (#29848) | ||
|
|
f1667b9137
|
chore(deps): bump deps (#29860)
* bump: version 0.4.73 → 0.4.74 * bump: version 1.88.0 → 1.89.0 * uv lock |
||
|
|
33c363d4d4
|
Extend the record/replay proxy to chat, embeddings, moderations, rerank, and Anthropic (#29847)
* test(ci): extend record/replay proxy to chat, embeddings, moderations, rerank, anthropic The record/replay proxy that took the gpt-image-1 spend E2E off the live OpenAI path now fronts every provider, so the other real-provider E2Es stop paying for and depending on live calls each commit. It keys per upstream and selects a non-OpenAI provider by a /__recorder_upstream/<host>/ path prefix carried on the model's api_base, since some litellm handlers (cohere rerank) drop custom request headers. Wired into build_and_test (chat, embeddings, moderations, image), the otel job (cohere rerank), and the anthropic-messages job via a reusable start_openai_record_replay_proxy command. Dropped the time.time()/uuid prompt cache-busters in the build_and_test chat tests, whose config has the response cache off, so identical requests are recordable. The image spend test now asserts a repeat call still bills spend, failing loudly if the proxy response cache is ever turned on. Responses, the anthropic passthrough, bedrock, and fake-endpoint tests are left live: their lifecycles, api_base assertions, providers, or fake targets make a stateless body-keyed cache either break them or add nothing. * docs(ci): note the recorder command's OpenAI default upstream and prefix override Addresses a review note: the shared start_openai_record_replay_proxy command defaults the upstream to OpenAI, so a non-OpenAI model must carry the /__recorder_upstream/<host>/ prefix on its api_base. Document that in the command description so a future caller does not assume the default follows the provider. |
||
|
|
38b28b96ff
|
fix(terraform/gcp): abandon SQL user on destroy (#29855)
google_sql_user.app issues DROP ROLE on destroy, which Postgres refuses because the role owns every table the migrations job created (75 objects). The previous deletion_policy=ABANDON on google_sql_database keeps the DB intact through destroy, so the role still owns its objects. Set the same policy on the user; the instance deletion takes both the database and the role with it anyway. |
||
|
|
43c10370ee
|
fix(terraform/gcp): prompt for image_registry in DeployStack one-click (#29852)
* fix(terraform/gcp): prompt for image_registry in DeployStack one-click The four litellm-* images live on GHCR and Cloud Run rejects ghcr.io URIs at apply time, so every deploy has to point image_registry at an Artifact Registry remote repo. The DeployStack installer didn't surface image_registry as a prompt, so a click-through user landed on the ghcr.io/berriai default and the apply failed ~20 min in, after Cloud SQL had already provisioned. Add image_registry to custom_settings with a PROJECT_ID-placeholder default and a description that flags the ghcr.io rejection so the failure happens at the prompt, not after billing the slow path. TUTORIAL.md is reworded to tell the user what to enter at the new prompt instead of "edit terraform.tfvars before applying". * fix(terraform/gcp): generalize image_registry default to any region Per Greptile feedback on #29852, the prior default hardcoded us-central1 and would silently produce a Cloud Run-incompatible image path for any deployment in another region. The user would substitute PROJECT_ID, miss the region segment, and reproduce the original late-apply failure. Use REGION as a second placeholder and tighten the prompt copy so both substitutions are mandatory. * fix(terraform/gcp): make destroy work without manual intervention Three Cloud Run v2 services and the migrations Cloud Run v2 job all default to deletion_protection=true at the provider level, which has no data-safety value on stateless resources and blocks terraform destroy with an error that can only be unstuck with a tfvars edit + apply roundtrip. Wire deletion_protection=false directly on all four; the operator-facing tripwire that matters is cloudsql_deletion_protection, which guards the only resource that actually holds data. The litellm Cloud SQL database also drops cleanly only if every connection is closed first. Cloud Run services and the migrations job hold connections open until they're torn down, so destroy races and fails with "database is being accessed by other users". Setting deletion_policy=ABANDON on the database resource lets terraform skip the explicit drop; the Cloud SQL instance deletion takes the database with it anyway. Together these turn destroy into a single command, matching the AWS stack's behavior. |
||
|
|
1975b9691a
|
chore: update Next.js build artifacts (2026-06-06 20:08 UTC, node v20.20.2) (#29853) | ||
|
|
1cff02f50e
|
refactor: convert AWS and GCP Terraform stacks into reusable modules … (#28103)
* refactor: convert AWS and GCP Terraform stacks into reusable modules with examples/default entry point
- Remove `provider` blocks from both AWS and GCP stack roots so the modules
can be consumed with `count`, `for_each`, `depends_on`, assumed-role or
aliased providers — patterns that are forbidden when a module owns its own
provider configuration
- Add `examples/default/` thin-root wrappers for both stacks that wire the
provider (AWS) / providers (google + google-beta) and call the module with
a curated variable surface, preserving the one-command deploy experience
- Move `terraform.tfvars.example` files into `examples/default/` alongside
the new roots; update example comments to reflect the curated variable surface
- Thread `local.tags` (containing `litellm:stack`, `managed-by`, and
`var.tags`) explicitly onto every taggable AWS resource since the module no
longer controls the provider's `default_tags`; GCP resource labels already
flow through the module's `labels` input
- Add `examples/default/variables.tf` and `outputs.tf` for both stacks,
exposing the most-used knobs and re-exporting all module outputs
- Commit provider lock files for both examples so `terraform init` is
reproducible without a network fetch
- Update top-level and per-stack READMEs to document the module-first design,
the `for_each` multi-tenant pattern, and the `examples/default/` quick-start path
* docs(terraform): address review — state-migration guide, tag dedupe, for_each note
- Add 'Migrating an existing deployment' section to AWS & GCP READMEs
documenting the required terraform state mv step (resource addresses now
gain a module.litellm. prefix under the examples/default root)
- Remove redundant managed-by tag from the AWS example providers.tf;
reserve default_tags there for org-wide tags only
- Document the for_each single-provider limitation for GCP (no
configuration_aliases) in the README and example main.tf
Resolves LIT-3504
* docs(terraform/gcp): note expected SSL cert replacement in state-migration guide
The managed SSL cert is named with a hash of lb_domains, so TLS-enabled
stacks that migrated from the old un-hashed name will see one
create_before_destroy cert replacement after terraform state mv — not a
clean 'No changes'. Document that this single replacement is expected and
safe.
* docs(terraform): drop state-migration guides
The AWS/GCP stacks have never been published, so there are no existing
deployments to migrate from the old root-module layout. Remove the
'Migrating an existing deployment' sections from both READMEs.
* docs(terraform): call out image-registry override required for GCP 1-click
The GCP stack's default image_registry points at ghcr.io, which Cloud
Run won't authenticate against, so any real deploy (HCP Terraform
no-code or otherwise) must override it. Document that as a hard
requirement on the GCP README rather than a side note, and add a
top-level HCP Terraform 1-click section enumerating the required
inputs per stack and the migration-task caveat for HCP-hosted runners.
* feat(terraform/aws): mount proxy_config from S3 and wire OpenTelemetry v2
proxy_config
Drop the inline LITELLM_PROXY_CONFIG_B64 env var. Upload the YAML to S3
at config/litellm-config.yaml; gateway and backend container entrypoints
download it to /tmp/litellm-config.yaml via boto3 before exec'ing
uvicorn. The S3 object etag is wired into the task definition so a
config edit produces a new task-def revision and a rolling redeploy. The
existing s3_access policy already grants the task role s3:GetObject on
this bucket, so no IAM changes were needed for the mount itself.
OpenTelemetry v2
New variables otel_endpoint, otel_exporter, otel_service_name, and
otel_headers_secret_arn. Setting otel_endpoint to a non-empty value adds
LITELLM_OTEL_V2=true plus OTEL_EXPORTER / OTEL_ENDPOINT /
OTEL_SERVICE_NAME / OTEL_ENVIRONMENT_NAME to the shared env block; an
optional Secrets Manager ARN backs OTEL_HEADERS for collectors that need
an auth header. Execution role auto-gains GetSecretValue on that ARN.
Empty endpoint = nothing added, so existing deployments are unchanged.
* feat(terraform/gcp): add DeployStack one-click installer
Wires up a Cloud Shell "Open in Cloud Shell" badge backed by the
GoogleCloudPlatform DeployStack flow so examples/default can be
installed from a click in the README without a local terraform setup.
- examples/default/deploystack.json drives project/region collection
plus prompts for tenant, env, image_tag, and allow_plaintext_lb.
Complex inputs (proxy_config, *_extra_secrets, lb_domains) and
sensitive vars (litellm_master_key, litellm_license, ui_password)
stay tfvars / env only so they never land in a committed file.
- examples/default/TUTORIAL.md is a Cloud Shell walkthrough that
enables required APIs, creates the GHCR-passthrough Artifact
Registry repo, optionally exports the TF_VAR_* secrets, runs
`deploystack install`, and shows how to fetch the master key plus
migrate from plaintext LB to TLS.
- Renames var.project to var.project_id across the module and the
examples/default wrapper to match the variable DeployStack injects
from `collect_project: true`. Breaking rename for anyone with a
`project = ...` line in terraform.tfvars; the fix is one line.
* feat(terraform/gcp): mount proxy_config from GCS and wire OpenTelemetry v2
proxy_config
Drop the inline LITELLM_PROXY_CONFIG_B64 env var and the python-decode
startup fragment. Upload the YAML to a dedicated GCS bucket as
config.yaml, then mount it read-only into the gateway and backend at
/etc/litellm via Cloud Run v2's gcsfuse volume. CONFIG_FILE_PATH points
at the mount; an md5 of the YAML rides along as PROXY_CONFIG_HASH so a
config-only edit forces a new Cloud Run revision (gcsfuse only surfaces
new objects on container restart, so without the hash an updated
proxy_config would sit in the bucket unread).
The config bucket is separate from the data-plane bucket so the runtime
SA can hold objectViewer here (read-only at runtime) while keeping
objectAdmin on the data-plane bucket. Both bucket and IAM binding are
gated on proxy_config != {}; an empty config skips bucket creation and
mounts nothing.
OpenTelemetry v2
LITELLM_OTEL_V2=true is now wired into shared_env_kv unconditionally so
both the gateway and backend boot with the integration enabled. It's
dormant until otel_endpoint is non-empty; setting it injects
OTEL_EXPORTER / OTEL_ENDPOINT / OTEL_ENVIRONMENT_NAME plus a
per-component OTEL_SERVICE_NAME (\${tenant}-litellm-\${env}-{gateway,backend})
so spans land tagged with the right hop. otel_headers_secret takes a
Secret Manager resource ID for OTEL_HEADERS (collector auth); the
runtime SA auto-gains roles/secretmanager.secretAccessor on it.
otel_capture_message_content defaults to no_content matching the litellm
default. Any OTEL_* key set in *_extra_env wins over the defaults so
Cloud Run doesn't reject the apply on the duplicate-env-name check.
* refactor(terraform): make AWS and GCP stacks behave identically
Bring both modules to the same surface and the same runtime behavior so
swapping clouds (or reading either README) is symmetric.
Labels and tags. GCP previously stamped var.labels onto only the two GCS
buckets, leaving Cloud Run, Cloud SQL, Memorystore, Secret Manager, and
the LB resources unlabeled; the variable description claimed full
coverage. Now the module computes local.labels (litellm-stack +
managed-by + var.labels, mirroring AWS's local.tags) and threads it onto
every label-supporting resource: Cloud Run services and the migrations
job, Cloud SQL writer and reader (via user_labels), Memorystore, Secret
Manager entries (master_key, license, ui_password, db_password), both
GCS buckets, the global LB address, and the http/https forwarding rules.
GCP keys use 'litellm-stack' instead of AWS's 'litellm:stack' because
GCP label keys forbid colons; var.labels now defaults to {}.
OpenTelemetry v2 is opt-in on both stacks. AWS already gated everything
on otel_endpoint; GCP previously stamped LITELLM_OTEL_V2=true into
shared_env unconditionally and only ungated the OTEL_* block. Both
stacks now do the same thing: leave otel_endpoint empty and nothing
OTel-related lands in the container env; set it and gateway and backend
get LITELLM_OTEL_V2=true plus OTEL_EXPORTER, OTEL_ENDPOINT,
OTEL_ENVIRONMENT_NAME, OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT,
and a per-component OTEL_SERVICE_NAME (${tenant}-litellm-${env}-gateway
or -backend) so spans land tagged with the right hop. AWS picks up the
richer GCP surface: otel_environment_name (defaults to var.env),
otel_capture_message_content (defaults to no_content), and *_extra_env
override filtering so a caller-set OTEL_* key wins over the default for
that service (ECS allows duplicates, but the filter gives the same
predictable last-wins shape Cloud Run enforces). var.otel_service_name
on AWS is gone, replaced by the per-component naming.
uvicorn workers. GCP gains gateway_num_workers, matching AWS; threads
into the gateway args as --workers ${var.gateway_num_workers}.
Docs reflect the parity: each README's OTel section, the GCP 'Using as
a module' Labels paragraph, and a new feature-parity table in the
top-level README that lays out the AWS/GCP input mapping side by side.
* fix(terraform/aws): expose skip_final_snapshot through the default example
The example wrapper already exposed `s3_force_destroy` so ephemeral / CI
stacks could destroy the S3 bucket without manual cleanup, but the matching
Aurora knob (`skip_final_snapshot`) was hidden behind the module surface.
That meant a `terraform destroy` on a trial stack still produced a
`<cluster>-final-<short-sha>` snapshot, with no opt-out short of editing the
module call.
Adds `var.skip_final_snapshot` to the example (default `false`, preserving
the data-loss tripwire) and threads it through to the module input,
mirroring the existing `s3_force_destroy` pattern. Documented alongside it
in the tfvars example.
Verified by deploying the example end-to-end against a clean AWS account
(VPC + Aurora w/ IAM auth + Redis + ALB + 3 ECS services), confirming all
services reach steady state and the data plane serves traffic, then running
`terraform destroy` with `skip_final_snapshot = true` to a clean teardown
(93 destroyed, no Aurora snapshot left behind, no leftover billable
resources).
---------
Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>
Co-authored-by: yassin-berriai <yassin.kortam@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
|
||
|
|
fdade8a84e
|
Title: fix(proxy): resolve vector store file list credentials from team deployments (#29739)
* fix(proxy): resolve vector store file list credentials from team deployments
GET /v1/vector_stores/{id}/files now uses the same router credential routing as POST, including JWT team model hints and wildcard model selectors, so list requests no longer call OpenAI with Bearer None.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(proxy): authorize model hints and fix credential routing for vector store file list
Resolves three review findings on the vector store file list path.
Authorize user-controlled model hints (?model= query param and the
x-litellm-model header) against the key's and team's allowed models via
can_key_call_model / _can_object_call_model before any deployment
credentials are resolved, closing a model access bypass where a normal
key could file-list using a restricted deployment's provider credentials.
Run the managed vector store registry resolution before the model routing
hint so the managed store sets the routing model first; the hint resolver
then selects credentials matching that model instead of a team fallback
deployment, avoiding a credential/model mismatch across deployments.
Skip team-fallback deployments whose provider cannot be determined instead
of treating them as OpenAI, so a deployment without an explicit
custom_llm_provider or "openai/" prefix no longer has its credentials
injected.
* fix(proxy): enforce vector store file model auth
Ensure vector store file listing routes authorize explicit and inferred model routing before resolving deployment credentials.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(proxy): type guard vector store model hints
Keep vector store model hint authorization typed to string-only values so static checks pass.
Co-authored-by: Cursor <cursoragent@cursor.com>
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
|
||
|
|
1fbb78d2a4
|
Title: Fix managed batch cancel credential resolution (#29734)
* Fix managed batch cancel credential resolution Decode unified batch IDs before cancel routing and resolve litellm_credential_name to api_key in Router._acancel_batch so JWT team-scoped deployments cancel with the same credentials used at create time Co-authored-by: Cursor <cursoragent@cursor.com> * fix batch cancellation credential cleanup Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> |
||
|
|
51769a8ede
|
feat(fal_ai): add Nano Banana / Gemini 2.5 Flash Image generation support (#29798)
* feat(fal_ai): add Nano Banana / Gemini 2.5 Flash Image generation support Adds a FalAINanoBananaConfig for fal.ai's Nano Banana models, exposed under both fal-ai/nano-banana and fal-ai/gemini-25-flash-image (identical schema). This is the migration path for fal-ai/imagen4, which fal deprecates on 2026-06-30. The config derives the request endpoint from the model name so both aliases route correctly, maps OpenAI image params to the fal schema (n -> num_images, size -> nearest supported aspect_ratio, response_format ignored since the model returns URLs), and reuses the base fal response parser. Pricing is registered at 0.039 per image in the cost map and backup. * fix(fal_ai): tighten nano-banana routing and guard mapped params Match the specific gemini-25-flash-image / gemini-2.5-flash-image aliases instead of any model containing gemini so future fal.ai Gemini-branded models aren't silently misrouted to the nano-banana config. Guard the param mapping on the fal-side keys (num_images, aspect_ratio) so a pre-set mapped value is respected and an OpenAI key is never forwarded unmapped. * fix(fal_ai): drop non-existent gemini-2.5-flash-image routing alias fal.ai only serves the dotted-free fal-ai/gemini-25-flash-image and fal-ai/nano-banana endpoints. Routing the dotted gemini-2.5-flash-image alias built a https://fal.run/fal-ai/gemini-2.5-flash-image URL that fal.ai 404s and had no pricing entry, so spend tracking silently fell to zero. Match only the two real endpoint slugs. |