litellm/tests/test_litellm/proxy/test_common_request_processing.py
Sameer Kankute 3b40ac987f
Litellm oss 090626 (#30021)
* fix(mcp): report scoped server name during initialize (#29865)

* fix mcp scoped server name

* Update litellm/proxy/_experimental/mcp_server/mcp_context.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* test(mcp): cover scoped server name in the SSE initialize handler

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix(ui): show all session logs in the drawer, not just the first 50 (#29795)

* fix(ui): show newest session logs first

* test(ui): keep session log pagination coverage

* fix(ui): show all session logs in the drawer, not just the first page

The session detail drawer fetched session logs via sessionSpendLogsCall
without page/page_size, so it only ever received the backend default of one
page (50 rows). Sessions with more than 50 calls had the rest unreachable in
the UI (#29153).

sessionSpendLogsCall now takes page/page_size, and the drawer fetches the
first page, reads total_pages, then fetches the remaining pages and
accumulates them before the existing client-side sort. This keeps the single
continuous list (and the selected-log lookup and keyboard navigation, which
all assume the full session) correct. Fetching is bounded by a page cap, and
the sidebar shows a "showing most recent N" note if a session exceeds it.

The rows are lightweight metadata (the endpoint excludes messages/response),
so the full set is small; request/response bodies are still loaded per log on
demand.

* fix(ui): default session drawer to most recent log, newest first

Open a session with its most recent log selected, and order the sidebar
newest-first to match the all-sessions logs overview. MCP calls stay
grouped last. The latest log by time is computed explicitly, since the
MCP grouping means it is not always the first row.

* Apply fetching pages in batches suggestion from @greptile-apps[bot]

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix(ui): derive session total from accumulated rows when backend omits it

Compute the session total after all pages are fetched, falling back to the
accumulated row count rather than the first page's. Guards the truncation
note against a backend response that omits total but spans multiple pages.

---------

Co-authored-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix(proxy): handle Mistral multipart passthrough (#29927)

* fix(proxy): handle Mistral multipart passthrough

* chore: satisfy passthrough ci formatting

* test(proxy): cover Mistral passthrough in CI shard

* fix(vertex_ai): use REP host for context caching on eu/us multi-region endpoints (#29573)

Context caching built the cachedContents URL as
https://{location}-aiplatform.googleapis.com, which is an invalid host for the
eu/us multi-region endpoints and returns 404. The inference path already
resolves these to the REP host (https://aiplatform.{geo}.rep.googleapis.com)
via get_vertex_base_url(); reuse that helper in
_get_token_and_url_context_caching so caching uses the same host as inference.

Adds tests covering the eu/us multi-region cachedContents URLs (v1 and
v1beta1).

Fixes #29571

* Support per-model encrypted content affinity config (#29760)

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>

* fix: propagate upstream status code in proxy API exception handler (#29402)

* fix: propagate upstream status code in proxy API exception handler

When Google GenAI / Vertex returns a 404 for deprecated or missing
models via streamGenerateContent, the exception was falling through to
a generic handler that defaulted to 500. Now provider exceptions
carrying a valid HTTP status_code correctly propagate it through to
the ProxyException.

* fix: apply black formatting to common_request_processing.py

* fix: tighten status code range to 400-599 and deduplicate ProxyException raise

* fix(tests): use valid vertex_location in context caching tests

Replace "test_location" (contains underscore) with "us-central1" so tests
pass the regex validation added in get_vertex_base_url().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(sdk): add xAI OAuth provider (#29866)

* Add xAI OAuth provider

* Update oauth.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Fix xAI OAuth CI failures

* Add xAI OAuth coverage tests

* Move xAI OAuth coverage tests to core utils

* Address xAI OAuth review comments

* Prevent xAI OAuth api_base token exfiltration

* Treat blank xAI OAuth api keys as absent

* Wrap invalid xAI OAuth JSON responses

* Use xAI OAuth behind explicit flag

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix(proxy) #27734 allow clearing budget_duration and team_member fields by sending null on /key/update and /team/update (#27751)

* fix(proxy): allow clearing budget_duration and team_member fields by sending null on /key/update and /team/update

Fixes #27734

Sending null for budget_duration, team_member_budget,
team_member_budget_duration, team_member_rpm_limit, or
team_member_tpm_limit via /key/update or /team/update returned 200 OK
but silently ignored the null value. The fields remained unchanged in
the database.

Root causes:
- /key/update: prepare_key_update_data() popped budget_duration from the
  update dict but never re-added it (or budget_reset_at) when the value
  was None.
- /team/update: _set_budget_reset_at() only acted when budget_duration
  was non-None, leaving a stale budget_reset_at in the DB.
- /team/update: team_member_* null values bypassed the budget table
  update entirely because should_create_budget() requires at least one
  non-None field.

* test(proxy): cover no-budget-row path in clear_team_member_budget_fields

* fix(presidio): unmask PII tokens in Anthropic native SSE streaming bytes (#30028)

* fix(presidio): unmask PII tokens in Anthropic native SSE streaming bytes

When output_parse_pii=true on the Anthropic native path (anthropic/claude-*),
response chunks arrive as raw bytes in SSE format. _stream_pii_unmasking was
yielding those bytes unchanged, so <PERSON_1> tokens were never replaced with
the original values before reaching the caller.

Add _unmask_sse_bytes_chunk to parse each data: line, find content_block_delta
/ text_delta events, and apply _unmask_pii_text before re-encoding. Wire it
into _stream_pii_unmasking so bytes chunks are unmasked when pii_tokens exist.

* fix(presidio): handle CRLF line endings and non-ASCII PII in SSE unmask

Strip trailing \r before the [DONE] guard so CRLF-terminated SSE chunks
don't bypass it and silently swallow a JSONDecodeError. Add
ensure_ascii=False to json.dumps so non-ASCII replacement values like
accented names are preserved as UTF-8 on the wire rather than being
\uXXXX-escaped. Add regression tests for both cases.

* feat(bedrock_mantle): path-aware Responses routing (/v1/responses vs /openai/v1/responses) (#29925)

* feat(bedrock_mantle): path-aware Responses routing (/v1/responses vs /openai/v1/responses)

Bedrock Mantle serves the Responses API on two upstream paths:
  - gpt frontier models (gpt-5.5 / gpt-5.4) on /openai/v1/responses
  - every other Responses-capable model (e.g. gpt-oss) on the standard /v1/responses

BedrockMantleResponsesAPIConfig gains a `use_openai_path` flag; the provider gate in
utils.py picks the path per model: openai.gpt-* (non gpt-oss) -> /openai/v1/responses;
any model declared mode=responses (price-map entry or user model_info) -> /v1/responses;
everything else returns None and keeps the existing chat-completions emulation.

Adds gpt-5.5 / gpt-5.4 price-map entries, registry wiring, and the routing-matrix tests.

* feat(bedrock_mantle): data-driven frontier routing via use_openai_responses_path

Addresses the Greptile review point that frontier detection should be a
price-map field rather than a hardcoded name match. The gate now routes a
model to /openai/v1/responses when its price-map entry declares
use_openai_responses_path, so a frontier model whose name does not follow the
openai.gpt- convention can be onboarded by JSON alone. The name-convention
check is kept as a fallback that needs no price-map entry, which preserves
zero-change routing for a future gpt-6 before its entry loads. gpt-5.5 / gpt-5.4
get the flag in both price maps. Adds tests for the data-driven flag path and
for the flag presence on the gpt-5.x entries; both branches are mutation-tested.

* test(model_prices): allow use_openai_responses_path in price-map schema

The model_prices_and_context_window.json schema validator
(test_aaamodel_prices_and_context_window_json_is_valid) enforces
additionalProperties: false, so the new use_openai_responses_path flag on the
gpt-5.5 / gpt-5.4 entries failed validation. Add it to the schema as a boolean,
alongside the other supports_* / capability flags.

* Add Tensormesh serverless models to the model cost map (#30037)

* Add Tensormesh serverless models to the model cost map

* Flag reasoning support on the Tensormesh models that expose thinking mode

* fix(proxy): invalidate stale key spend counter after budget reset or manual spend update (#30001)

* fix(proxy): reconcile stale key spend counter after budget reset

* fix(proxy): invalidate stale key spend counter after budget reset or manual spend update

* fix(proxy): remove read-time stale counter reconciliation to prevent budget bypass

* revert: undo unrelated formatting changes in enterprise directory

* test(proxy): add unit test for key spend update invalidating counter

* test(proxy): fix mocked update_data and hash token expectations in unit test

* fix(proxy): use Responses-API transformer in pass-through cost tracking (#29728)

The `elif is_responses:` branch of `openai_passthrough_handler` was
calling the chat-completions `transform_response` on a Responses API
payload. The chat-completions transformer expects `choices: [...]`
in the raw response; the Responses API uses `output: [...]` and
`usage.input_tokens` / `usage.output_tokens` (not
`prompt_tokens` / `completion_tokens`). The result was a
KeyError 'choices' deep inside `convert_to_model_response_object`,
swallowed by the surrounding `except Exception` in the handler, and
the SpendLogs row was written by the fallback path with zeroed-out
tokens, spend, and model.

This bug silently undercounts cost for every successful pass-through
call to either OpenAI's `/v1/responses` or Azure's
`/openai/v1/responses` (deployments configured for the Responses
API). Reproduced 2026-06-04 against a real Azure OpenAI Responses
API deployment proxied through LiteLLM v1.88.0.

Fix: use the dedicated
`OpenAIResponsesAPIConfig.transform_response_api_response` for the
Responses branch. This transformer already exists in LiteLLM
(`litellm/llms/openai/responses/transformation.py`) and knows the
Responses-API on-the-wire shape. `litellm.completion_cost` already
handles `ResponsesAPIResponse` natively with `call_type="responses"`,
so no downstream changes are needed.

Tests:

  test_responses_api_uses_responses_transformer_not_chat_completions
    NEW. Real regression test — exercises the openai_passthrough_handler
    with a real-shaped Responses payload (no `choices`, has `output`
    and Responses-API `usage` keys) and NO mocked `get_provider_config`.
    Pre-fix: raises KeyError 'choices' inside the chat-completions
    transformer (the bug). Post-fix: returns a ResponsesAPIResponse,
    completion_cost is called with call_type="responses" and a
    ResponsesAPIResponse instance (asserted).
    Verified to fail on un-fixed handler + pass on fixed handler
    before commit.

  test_responses_api_cost_tracking
    UPDATED. Old test mocked `get_provider_config` (no longer called
    in the responses branch post-fix). Now mocks the Responses
    transformer directly (`OpenAIResponsesAPIConfig.transform_response_api_response`)
    to test the downstream cost-calc contract.

Out of scope for this PR (separate followup):
  - Recognizing *.cognitiveservices.azure.com (the newer Azure
    OpenAI hostname) in the is_openai_*_route checks. Separate PR.

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>

* fix(skills): execute DB skills by matching the litellm_skill_ tool name prefix (#30116)

Skill IDs are generated as litellm_skill_<uuid> and the model-facing
tool name is the sanitized skill ID, but the post-call execution gates
in SkillsInjectionHook only ran tools whose name starts with "skill_",
so DB skills were silently returned to the client as raw tool calls.

Fixes #28122.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(anthropic): synthesize content_block_start when Responses stream omits output_item.added (#30115)

* fix(team): reserve team budget raises for proxy admins on /team/update (#30030)

The caller's PERSONAL max_budget was the wrong yardstick for /team/update: a
team's spend ceiling has nothing to do with the admin's own key budget. That
comparison was an unintended side effect of reusing _check_user_team_limits()
(which exists for the /team/new path) and broke the UI, which re-sends the
unchanged budget on every save.

New behavior on /team/update for standalone teams:
- A team admin (already authorized via _verify_team_access) may freely KEEP or
  LOWER the team budget, and change models/tpm/rpm, without being gated by their
  personal limits.
- GROWING a team's spend ceiling is a budget-authority action reserved for proxy
  admins -> 403 for team admins. "Growing" covers both raising max_budget above
  the team's current finite value and removing the cap entirely (max_budget=null,
  detected via model_fields_set so an explicit null is distinguished from an
  omitted field). For a team that currently has no cap, setting a finite value is
  a restriction and is allowed.
- Org-scoped teams remain governed by _check_org_team_limits() (capped by the
  org budget).

Also reverts the #29525 existing_team_max_budget workaround in
_check_user_team_limits() back to the create-only form; /team/new still enforces
the creator's personal caps.

docs(access_control): resolve the contradiction in the team-admin section —
team admins can keep/lower the budget and manage rate limits/models, but cannot
raise the team budget (proxy-admin only).

tests: unit + behavior coverage for raise-blocked, cap-removal-blocked (team
admin), raise/removal allowed (proxy admin), uncapped-team restriction allowed,
keep/lower/resend allowed, and unchanged create-path guards.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(ui): data-driven App Router migration E2E smoke (default + server-root-path) (#29974)

* test(ui): add a data-driven App Router migration E2E smoke

Add a growing Playwright smoke for migrated pages: for each segment it deep-links
to the path route, asserts the URL and that the dashboard shell rendered, then
clicks off to a legacy page and asserts navigation still works. Driven by
e2e_tests/fixtures/migratedPages.ts, so adding a page is one line.

Runs in two situations against the same proxy: the default mount (npm run
e2e:migration) and a non-root SERVER_ROOT_PATH mount (npm run e2e:migration:root).
globalSetup now logs in at `${SERVER_ROOT_PATH}/ui/login` so the admin storage
state is valid under a prefix. Seeded with api-reference; append the rest as their
migrations merge.

* test(ui): support headed slow-motion + watch pauses in the migration smoke

Honor SLOWMO in the server-root-path config (the default config already did),
and add an env-gated E2E_WATCH_MS pause so a headed run lingers on each state.
Both are no-ops by default, so CI behavior is unchanged.

* test(ui): make the migration smoke a sidebar-click user journey

Rework the smoke from deep-linking to a real navigation journey: start at the
landing page, click the migrated page in the sidebar (expanding submenus for
nested items), assert the path route rendered, reload it (the check a wrong
server_root_path breaks), bounce to a legacy page and back, and — once two pages
are migrated — navigate directly between two migrated pages. Verifies via URL +
shell render, driven by the same fixture list.

* test(ui): address review on the migration smoke

Escape ROOT and segment before interpolating them into RegExp URL matchers so a
future segment containing regex metacharacters can't silently widen the match.
Make the server-root-path config fail fast when SERVER_ROOT_PATH is unset instead
of silently re-running the default mount and passing without exercising the prefix.

* test(ui): drop unused watch helper and fix stale smoke README

* test(ui): run the migration smoke under a server root path in CI

* test(ui): harden + instrument the server-root-path proxy reboot in CI

* test(ui): run the server-root-path migration smoke as its own CI job

Replace the in-place proxy reboot in e2e_ui_testing with a dedicated
e2e_ui_testing_server_root_path job that boots the proxy once with
SERVER_ROOT_PATH=/litellm, matching how every other proxy variant in the
config gets its own job rather than killing and relaunching the live proxy.

The reboot was failing deterministically: after pkill -9 and relaunch the
prefixed proxy never came back up on :4000 (connection refused), so the smoke
never ran. The readiness step that was supposed to surface the cause could
never reach its boot-log tail because CircleCI runs steps under bash -eo
pipefail and the preceding `curl -sv ... | tail` aborted the step with curl's
exit 7. Booting the proxy as the job's own background step lets any boot crash
land in that step's log instead of being swallowed.

The default e2e_ui_testing job is unchanged aside from dropping the reboot,
prefixed-readiness, and prefixed-smoke steps; the migration smoke still runs at
the root mount there via the default Playwright config.

* fix(proxy): extend response headers hook to streaming, TTS, image gen, and pass-through (#24232)

* fix(proxy): extend response headers hook to streaming, TTS, image gen, and pass-through

* test: mock post_call_response_headers_hook in audio speech route tests

* chore(ui): remove dead App Router route stubs under (dashboard) (#30045)

models-and-endpoints, organizations, and virtual-keys each had a page.tsx
route under (dashboard)/ that is not in MIGRATED_PAGES, so the sidebar and
deep links never resolve to it and the route is unreachable. Each was a thin
wrapper that handed the shared view empty or no-op props (empty modelData with
a no-op setModelData, hardcoded empty organizations, no-op
setUserRole/setUserEmail), so reaching one would render a degraded page in any
case. The real wrapper belongs in the PR that flips each page into
MIGRATED_PAGES, written with eyes on it and a test

This continues the dead-scaffolding cleanup from #28891. The shared components
these wrappers rendered (ModelsAndEndpointsView, OrganizationFilters) stay,
since the legacy ?page= switch in app/page.tsx and src/components still import
them

* fix(ui/mcp): reset OAuth state on create-server modal close so a prior server's token no longer leaks into the next add-server session (#30000)

* fix(ui/mcp): reset OAuth hook state on modal close so a prior server's token no longer leaks into the next add-server session

* fix(ui/mcp): clear in-flight OAuth guard on reset and reset form/tools on modal close so nothing leaks on a parent-driven dismiss

* fix(mcp): allow team access-group grants in OAuth authorize/token access check (#30041)

* fix(mcp): honor team access-group grants in OAuth authorize/token access check

* test(mcp): mock build_effective_auth_contexts in non-admin authorize tests for isolation

* docs(security): require a reproduction video for vulnerability reports (#30048) (#30063)

With AI models capable of automated vulnerability discovery now publicly
available, we expect a large increase in report volume, much of it
unverified. Requiring a video of the exploit running against a live
instance raises the bar for submissions and keeps triage focused on
reproducible issues. Reports without a video will be closed and reopened
if one is added later.

Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com>

* feat(ui): add admin flag to disable in-product UI nudges for everyone (#29796)

* feat(ui): add admin flag to disable in-product UI nudges for everyone

Admins can now suppress the survey and Claude Code feedback popups for
all users via a single disable_ui_nudges UI setting, instead of relying
on each user dismissing them individually.

* fix(ui): suppress nudges while ui settings are loading

Gate nudgesDisabled on the ui-settings loading state so an admin with
disable_ui_nudges on doesn't see the survey prompt flash, and the
getInProductNudgesCall fetch doesn't fire, on a cold page load before
the flag resolves. Falls back to showing nudges if the fetch errors.

* test(ui): wrap CreateKeyPage test in QueryClientProvider

page.tsx now calls useUISettings (react-query), which needs a
QueryClient that layout.tsx supplies in production but the test did
not. Add the provider and mock getUiSettings so the query resolves.

* chore(ui): remove dead dashboard files and unused dependencies (#30047)

* chore(ui): remove dead dashboard files and unused dependencies

knip flagged seven orphaned source/config files with no importers and
five declared dependencies that nothing in the tree uses. Removing them
shrinks the dashboard bundle's source surface and keeps the manifest
honest; vite stays installed transitively via vitest, so test tooling is
unaffected.

* fix(ci): restore serverRootPath.config.ts referenced by SERVER_ROOT_PATH workflow

The dead-code sweep removed e2e_tests/serverRootPath.config.ts, but its spec
(tests/login/serverRootPathRedirect.spec.ts) and the test_server_root_path.yml
workflow step still depend on it, so the redirect e2e job failed to load a
config that no longer existed.

* fix(proxy): authorize batch files using upload target_model_names (LIT-3593) (#30009)

* fix(proxy): authorize batch files using upload target_model_names (LIT-3593)

After replace_model_in_jsonl, body.model is a stripped provider id. Reverse-mapping it via resolve_model_name_from_model_id is first-match on model_list and caused false 403s when multiple deployments share the same stripped name. Use target_model_names from the unified file id instead.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(proxy): restore resolve_model_name_from_model_id for JSONL fallback path (LIT-3593)

Restores the reverse-lookup for the JSONL body.model fallback path so that
legacy/pre-target_model_names managed files still map stripped provider IDs
back to proxy aliases before auth. Also cleans up redundant `or None`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Revert "fix(proxy): restore resolve_model_name_from_model_id for JSONL fallback path (LIT-3593)"

This reverts commit 30d2e96f77ef521ccaaf2193fe554980380eb669.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Add Claude Fable 5 across Anthropic, Bedrock, Vertex AI, and Azure AI (#30064)

* Add Claude Fable 5 across Anthropic, Bedrock, Vertex AI, and Azure AI

Adds cost map entries for claude-fable-5 ($10/$50 per MTok, 1M context,
128K output, adaptive thinking only) on the Anthropic API, Bedrock
converse (base, global, and us/eu geo inference profiles at the 10%
regional premium), Vertex AI, and Azure AI (Microsoft Foundry, which
serves Fable 5 with the full 1M context window unlike Opus 4.8).

Registers anthropic.claude-fable-5 in BEDROCK_CONVERSE_MODELS, lists the
model in the setup wizard, and extends the reasoning effort e2e grid.
The Bedrock, Vertex, and Azure grid cells carry fail_reason markers
until the CI accounts are provisioned: Bedrock needs the provider data
sharing opt-in Fable 5 requires, and the Foundry resource needs a
claude-fable-5 deployment.

The first-party entry carries provider_specific_entry {us: 1.1} for the
inference_geo premium and deliberately no fast multiplier since Fable 5
has no fast mode.

https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm

* Drop removed sampling params for Claude 4.7+ when drop_params is set

Fable 5, Opus 4.7, and Opus 4.8 removed sampling params: the API rejects
top_p, top_k, and any temperature other than 1 with a 400. LiteLLM was
forwarding them even with drop_params enabled because the Anthropic and
Bedrock converse transformations passed temperature/top_p through
unconditionally.

Mirror the GPT-5/o-series handling: temperature=1 still passes through,
other values and any top_p are dropped when drop_params is set, and
without drop_params a clean client-side UnsupportedParamsError tells the
caller how to opt in, instead of surfacing the raw provider error.

https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm

* Drive sampling param gating from the cost map and cover top_k

Greptile review follow-ups on the sampling param fix: the restriction for
Fable 5 / Opus 4.7 / 4.8 is now declared as supports_sampling_params: false
on every affected cost map entry (perplexity excluded; that route is
OpenAI-compatible and maps sampling params upstream) and read back through
a tri-state map lookup, keeping the name check only as a fallback for
provider-routed ids whose hosted map entries predate the flag, the same
layering supports_adaptive_thinking uses. top_k bypasses map_openai_params
as a provider-specific kwarg, so it is gated at the shared
AnthropicConfig.transform_request boundary (direct, Bedrock invoke, Vertex,
Azure) and in the Bedrock converse _handle_top_k_value path, with
drop_params threaded through the converse transform helpers.

Also updates the reasoning effort grid cell count assertion for the four
Fable 5 rows added on this branch (29 x 11 cells).

https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm

* Declare supports_sampling_params in the cost map schema

The model map validation schema uses additionalProperties: false, so the
new flag must be declared for the 28 entries that carry it; this was the
one failing job (misc / Run tests) on the previous commit.

https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm

* fix(bedrock): gate top_k=0 on converse to match Anthropic boundary

Truthiness check let top_k=0 silently disappear on models that removed
sampling params, while AnthropicConfig.transform_request treats 0 as
present and raises UnsupportedParamsError (or drops when drop_params is
set). Switch to 'is not None' so converse, direct Anthropic, invoke,
Vertex, and Azure all behave the same for top_k=0.

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>

* fix(anthropic): avoid index -1 content_block_delta in messages stream

When a /v1/messages request is routed through the Responses API
adapter, AnthropicResponsesStreamWrapper only emits content_block_start
on response.output_item.added. Some upstreams (LMStudio for example)
never send that event, so the text delta handler fell back to
_current_block_index, which starts at -1, and clients received
content_block_delta events with index -1 and no preceding
content_block_start. Anthropic SDKs then fail with "text part -1 not
found"

The text delta handler now synthesizes a content_block_start with a
fresh block index whenever the delta references an unregistered item_id
or no block is open yet, and registers the item_id so follow-up deltas
reuse the same index

Addresses the /v1/messages defect in #27442

* Make test sys.path shim resolve relative to the file, not the CWD

os.path.abspath("../../../../../../..") depends on where pytest is
invoked from; anchoring on os.path.dirname(__file__) makes the import
work from any working directory. Also corrects the depth: the repo root
is six levels above this file, not seven.

---------

Co-authored-by: milan-berri <milan@berri.ai>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: ryan-crabbe-berri <ryan@berri.ai>
Co-authored-by: michelligabriele <gabriele.michelli@icloud.com>
Co-authored-by: tin-berri <tin@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com>
Co-authored-by: Sameer Kankute <sameer@berri.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com>

* fix: enable compact-2026-01-12 beta header for vertex_ai provider (#30114)

* fix(team): reserve team budget raises for proxy admins on /team/update (#30030)

The caller's PERSONAL max_budget was the wrong yardstick for /team/update: a
team's spend ceiling has nothing to do with the admin's own key budget. That
comparison was an unintended side effect of reusing _check_user_team_limits()
(which exists for the /team/new path) and broke the UI, which re-sends the
unchanged budget on every save.

New behavior on /team/update for standalone teams:
- A team admin (already authorized via _verify_team_access) may freely KEEP or
  LOWER the team budget, and change models/tpm/rpm, without being gated by their
  personal limits.
- GROWING a team's spend ceiling is a budget-authority action reserved for proxy
  admins -> 403 for team admins. "Growing" covers both raising max_budget above
  the team's current finite value and removing the cap entirely (max_budget=null,
  detected via model_fields_set so an explicit null is distinguished from an
  omitted field). For a team that currently has no cap, setting a finite value is
  a restriction and is allowed.
- Org-scoped teams remain governed by _check_org_team_limits() (capped by the
  org budget).

Also reverts the #29525 existing_team_max_budget workaround in
_check_user_team_limits() back to the create-only form; /team/new still enforces
the creator's personal caps.

docs(access_control): resolve the contradiction in the team-admin section —
team admins can keep/lower the budget and manage rate limits/models, but cannot
raise the team budget (proxy-admin only).

tests: unit + behavior coverage for raise-blocked, cap-removal-blocked (team
admin), raise/removal allowed (proxy admin), uncapped-team restriction allowed,
keep/lower/resend allowed, and unchanged create-path guards.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(ui): data-driven App Router migration E2E smoke (default + server-root-path) (#29974)

* test(ui): add a data-driven App Router migration E2E smoke

Add a growing Playwright smoke for migrated pages: for each segment it deep-links
to the path route, asserts the URL and that the dashboard shell rendered, then
clicks off to a legacy page and asserts navigation still works. Driven by
e2e_tests/fixtures/migratedPages.ts, so adding a page is one line.

Runs in two situations against the same proxy: the default mount (npm run
e2e:migration) and a non-root SERVER_ROOT_PATH mount (npm run e2e:migration:root).
globalSetup now logs in at `${SERVER_ROOT_PATH}/ui/login` so the admin storage
state is valid under a prefix. Seeded with api-reference; append the rest as their
migrations merge.

* test(ui): support headed slow-motion + watch pauses in the migration smoke

Honor SLOWMO in the server-root-path config (the default config already did),
and add an env-gated E2E_WATCH_MS pause so a headed run lingers on each state.
Both are no-ops by default, so CI behavior is unchanged.

* test(ui): make the migration smoke a sidebar-click user journey

Rework the smoke from deep-linking to a real navigation journey: start at the
landing page, click the migrated page in the sidebar (expanding submenus for
nested items), assert the path route rendered, reload it (the check a wrong
server_root_path breaks), bounce to a legacy page and back, and — once two pages
are migrated — navigate directly between two migrated pages. Verifies via URL +
shell render, driven by the same fixture list.

* test(ui): address review on the migration smoke

Escape ROOT and segment before interpolating them into RegExp URL matchers so a
future segment containing regex metacharacters can't silently widen the match.
Make the server-root-path config fail fast when SERVER_ROOT_PATH is unset instead
of silently re-running the default mount and passing without exercising the prefix.

* test(ui): drop unused watch helper and fix stale smoke README

* test(ui): run the migration smoke under a server root path in CI

* test(ui): harden + instrument the server-root-path proxy reboot in CI

* test(ui): run the server-root-path migration smoke as its own CI job

Replace the in-place proxy reboot in e2e_ui_testing with a dedicated
e2e_ui_testing_server_root_path job that boots the proxy once with
SERVER_ROOT_PATH=/litellm, matching how every other proxy variant in the
config gets its own job rather than killing and relaunching the live proxy.

The reboot was failing deterministically: after pkill -9 and relaunch the
prefixed proxy never came back up on :4000 (connection refused), so the smoke
never ran. The readiness step that was supposed to surface the cause could
never reach its boot-log tail because CircleCI runs steps under bash -eo
pipefail and the preceding `curl -sv ... | tail` aborted the step with curl's
exit 7. Booting the proxy as the job's own background step lets any boot crash
land in that step's log instead of being swallowed.

The default e2e_ui_testing job is unchanged aside from dropping the reboot,
prefixed-readiness, and prefixed-smoke steps; the migration smoke still runs at
the root mount there via the default Playwright config.

* fix(proxy): extend response headers hook to streaming, TTS, image gen, and pass-through (#24232)

* fix(proxy): extend response headers hook to streaming, TTS, image gen, and pass-through

* test: mock post_call_response_headers_hook in audio speech route tests

* chore(ui): remove dead App Router route stubs under (dashboard) (#30045)

models-and-endpoints, organizations, and virtual-keys each had a page.tsx
route under (dashboard)/ that is not in MIGRATED_PAGES, so the sidebar and
deep links never resolve to it and the route is unreachable. Each was a thin
wrapper that handed the shared view empty or no-op props (empty modelData with
a no-op setModelData, hardcoded empty organizations, no-op
setUserRole/setUserEmail), so reaching one would render a degraded page in any
case. The real wrapper belongs in the PR that flips each page into
MIGRATED_PAGES, written with eyes on it and a test

This continues the dead-scaffolding cleanup from #28891. The shared components
these wrappers rendered (ModelsAndEndpointsView, OrganizationFilters) stay,
since the legacy ?page= switch in app/page.tsx and src/components still import
them

* fix(ui/mcp): reset OAuth state on create-server modal close so a prior server's token no longer leaks into the next add-server session (#30000)

* fix(ui/mcp): reset OAuth hook state on modal close so a prior server's token no longer leaks into the next add-server session

* fix(ui/mcp): clear in-flight OAuth guard on reset and reset form/tools on modal close so nothing leaks on a parent-driven dismiss

* fix(mcp): allow team access-group grants in OAuth authorize/token access check (#30041)

* fix(mcp): honor team access-group grants in OAuth authorize/token access check

* test(mcp): mock build_effective_auth_contexts in non-admin authorize tests for isolation

* docs(security): require a reproduction video for vulnerability reports (#30048) (#30063)

With AI models capable of automated vulnerability discovery now publicly
available, we expect a large increase in report volume, much of it
unverified. Requiring a video of the exploit running against a live
instance raises the bar for submissions and keeps triage focused on
reproducible issues. Reports without a video will be closed and reopened
if one is added later.

Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com>

* feat(ui): add admin flag to disable in-product UI nudges for everyone (#29796)

* feat(ui): add admin flag to disable in-product UI nudges for everyone

Admins can now suppress the survey and Claude Code feedback popups for
all users via a single disable_ui_nudges UI setting, instead of relying
on each user dismissing them individually.

* fix(ui): suppress nudges while ui settings are loading

Gate nudgesDisabled on the ui-settings loading state so an admin with
disable_ui_nudges on doesn't see the survey prompt flash, and the
getInProductNudgesCall fetch doesn't fire, on a cold page load before
the flag resolves. Falls back to showing nudges if the fetch errors.

* test(ui): wrap CreateKeyPage test in QueryClientProvider

page.tsx now calls useUISettings (react-query), which needs a
QueryClient that layout.tsx supplies in production but the test did
not. Add the provider and mock getUiSettings so the query resolves.

* chore(ui): remove dead dashboard files and unused dependencies (#30047)

* chore(ui): remove dead dashboard files and unused dependencies

knip flagged seven orphaned source/config files with no importers and
five declared dependencies that nothing in the tree uses. Removing them
shrinks the dashboard bundle's source surface and keeps the manifest
honest; vite stays installed transitively via vitest, so test tooling is
unaffected.

* fix(ci): restore serverRootPath.config.ts referenced by SERVER_ROOT_PATH workflow

The dead-code sweep removed e2e_tests/serverRootPath.config.ts, but its spec
(tests/login/serverRootPathRedirect.spec.ts) and the test_server_root_path.yml
workflow step still depend on it, so the redirect e2e job failed to load a
config that no longer existed.

* fix(proxy): authorize batch files using upload target_model_names (LIT-3593) (#30009)

* fix(proxy): authorize batch files using upload target_model_names (LIT-3593)

After replace_model_in_jsonl, body.model is a stripped provider id. Reverse-mapping it via resolve_model_name_from_model_id is first-match on model_list and caused false 403s when multiple deployments share the same stripped name. Use target_model_names from the unified file id instead.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(proxy): restore resolve_model_name_from_model_id for JSONL fallback path (LIT-3593)

Restores the reverse-lookup for the JSONL body.model fallback path so that
legacy/pre-target_model_names managed files still map stripped provider IDs
back to proxy aliases before auth. Also cleans up redundant `or None`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Revert "fix(proxy): restore resolve_model_name_from_model_id for JSONL fallback path (LIT-3593)"

This reverts commit 30d2e96f77ef521ccaaf2193fe554980380eb669.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Add Claude Fable 5 across Anthropic, Bedrock, Vertex AI, and Azure AI (#30064)

* Add Claude Fable 5 across Anthropic, Bedrock, Vertex AI, and Azure AI

Adds cost map entries for claude-fable-5 ($10/$50 per MTok, 1M context,
128K output, adaptive thinking only) on the Anthropic API, Bedrock
converse (base, global, and us/eu geo inference profiles at the 10%
regional premium), Vertex AI, and Azure AI (Microsoft Foundry, which
serves Fable 5 with the full 1M context window unlike Opus 4.8).

Registers anthropic.claude-fable-5 in BEDROCK_CONVERSE_MODELS, lists the
model in the setup wizard, and extends the reasoning effort e2e grid.
The Bedrock, Vertex, and Azure grid cells carry fail_reason markers
until the CI accounts are provisioned: Bedrock needs the provider data
sharing opt-in Fable 5 requires, and the Foundry resource needs a
claude-fable-5 deployment.

The first-party entry carries provider_specific_entry {us: 1.1} for the
inference_geo premium and deliberately no fast multiplier since Fable 5
has no fast mode.

https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm

* Drop removed sampling params for Claude 4.7+ when drop_params is set

Fable 5, Opus 4.7, and Opus 4.8 removed sampling params: the API rejects
top_p, top_k, and any temperature other than 1 with a 400. LiteLLM was
forwarding them even with drop_params enabled because the Anthropic and
Bedrock converse transformations passed temperature/top_p through
unconditionally.

Mirror the GPT-5/o-series handling: temperature=1 still passes through,
other values and any top_p are dropped when drop_params is set, and
without drop_params a clean client-side UnsupportedParamsError tells the
caller how to opt in, instead of surfacing the raw provider error.

https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm

* Drive sampling param gating from the cost map and cover top_k

Greptile review follow-ups on the sampling param fix: the restriction for
Fable 5 / Opus 4.7 / 4.8 is now declared as supports_sampling_params: false
on every affected cost map entry (perplexity excluded; that route is
OpenAI-compatible and maps sampling params upstream) and read back through
a tri-state map lookup, keeping the name check only as a fallback for
provider-routed ids whose hosted map entries predate the flag, the same
layering supports_adaptive_thinking uses. top_k bypasses map_openai_params
as a provider-specific kwarg, so it is gated at the shared
AnthropicConfig.transform_request boundary (direct, Bedrock invoke, Vertex,
Azure) and in the Bedrock converse _handle_top_k_value path, with
drop_params threaded through the converse transform helpers.

Also updates the reasoning effort grid cell count assertion for the four
Fable 5 rows added on this branch (29 x 11 cells).

https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm

* Declare supports_sampling_params in the cost map schema

The model map validation schema uses additionalProperties: false, so the
new flag must be declared for the 28 entries that carry it; this was the
one failing job (misc / Run tests) on the previous commit.

https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm

* fix(bedrock): gate top_k=0 on converse to match Anthropic boundary

Truthiness check let top_k=0 silently disappear on models that removed
sampling params, while AnthropicConfig.transform_request treats 0 as
present and raises UnsupportedParamsError (or drops when drop_params is
set). Switch to 'is not None' so converse, direct Anthropic, invoke,
Vertex, and Azure all behave the same for top_k=0.

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>

* fix: enable compact-2026-01-12 beta header for vertex_ai provider

The vertex_ai block in anthropic_beta_headers_config.json mapped
compact-2026-01-12 to null, so update_headers_with_filtered_beta
stripped the header before the request reached Vertex while the
compact_20260112 context edit stayed in the body, and Vertex rejected
the request with HTTP 400. Vertex rawPredict accepts the header, and
the bedrock and databricks blocks already forward it. Mirrors #21867,
which enabled context-1m-2025-08-07 for vertex_ai the same way.

Fixes #27290.

---------

Co-authored-by: milan-berri <milan@berri.ai>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: ryan-crabbe-berri <ryan@berri.ai>
Co-authored-by: michelligabriele <gabriele.michelli@icloud.com>
Co-authored-by: tin-berri <tin@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com>
Co-authored-by: Sameer Kankute <sameer@berri.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com>

* fix(proxy): coerce litellm_settings.max_budget env var to float (#30113)

* fix(team): reserve team budget raises for proxy admins on /team/update (#30030)

The caller's PERSONAL max_budget was the wrong yardstick for /team/update: a
team's spend ceiling has nothing to do with the admin's own key budget. That
comparison was an unintended side effect of reusing _check_user_team_limits()
(which exists for the /team/new path) and broke the UI, which re-sends the
unchanged budget on every save.

New behavior on /team/update for standalone teams:
- A team admin (already authorized via _verify_team_access) may freely KEEP or
  LOWER the team budget, and change models/tpm/rpm, without being gated by their
  personal limits.
- GROWING a team's spend ceiling is a budget-authority action reserved for proxy
  admins -> 403 for team admins. "Growing" covers both raising max_budget above
  the team's current finite value and removing the cap entirely (max_budget=null,
  detected via model_fields_set so an explicit null is distinguished from an
  omitted field). For a team that currently has no cap, setting a finite value is
  a restriction and is allowed.
- Org-scoped teams remain governed by _check_org_team_limits() (capped by the
  org budget).

Also reverts the #29525 existing_team_max_budget workaround in
_check_user_team_limits() back to the create-only form; /team/new still enforces
the creator's personal caps.

docs(access_control): resolve the contradiction in the team-admin section —
team admins can keep/lower the budget and manage rate limits/models, but cannot
raise the team budget (proxy-admin only).

tests: unit + behavior coverage for raise-blocked, cap-removal-blocked (team
admin), raise/removal allowed (proxy admin), uncapped-team restriction allowed,
keep/lower/resend allowed, and unchanged create-path guards.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(ui): data-driven App Router migration E2E smoke (default + server-root-path) (#29974)

* test(ui): add a data-driven App Router migration E2E smoke

Add a growing Playwright smoke for migrated pages: for each segment it deep-links
to the path route, asserts the URL and that the dashboard shell rendered, then
clicks off to a legacy page and asserts navigation still works. Driven by
e2e_tests/fixtures/migratedPages.ts, so adding a page is one line.

Runs in two situations against the same proxy: the default mount (npm run
e2e:migration) and a non-root SERVER_ROOT_PATH mount (npm run e2e:migration:root).
globalSetup now logs in at `${SERVER_ROOT_PATH}/ui/login` so the admin storage
state is valid under a prefix. Seeded with api-reference; append the rest as their
migrations merge.

* test(ui): support headed slow-motion + watch pauses in the migration smoke

Honor SLOWMO in the server-root-path config (the default config already did),
and add an env-gated E2E_WATCH_MS pause so a headed run lingers on each state.
Both are no-ops by default, so CI behavior is unchanged.

* test(ui): make the migration smoke a sidebar-click user journey

Rework the smoke from deep-linking to a real navigation journey: start at the
landing page, click the migrated page in the sidebar (expanding submenus for
nested items), assert the path route rendered, reload it (the check a wrong
server_root_path breaks), bounce to a legacy page and back, and — once two pages
are migrated — navigate directly between two migrated pages. Verifies via URL +
shell render, driven by the same fixture list.

* test(ui): address review on the migration smoke

Escape ROOT and segment before interpolating them into RegExp URL matchers so a
future segment containing regex metacharacters can't silently widen the match.
Make the server-root-path config fail fast when SERVER_ROOT_PATH is unset instead
of silently re-running the default mount and passing without exercising the prefix.

* test(ui): drop unused watch helper and fix stale smoke README

* test(ui): run the migration smoke under a server root path in CI

* test(ui): harden + instrument the server-root-path proxy reboot in CI

* test(ui): run the server-root-path migration smoke as its own CI job

Replace the in-place proxy reboot in e2e_ui_testing with a dedicated
e2e_ui_testing_server_root_path job that boots the proxy once with
SERVER_ROOT_PATH=/litellm, matching how every other proxy variant in the
config gets its own job rather than killing and relaunching the live proxy.

The reboot was failing deterministically: after pkill -9 and relaunch the
prefixed proxy never came back up on :4000 (connection refused), so the smoke
never ran. The readiness step that was supposed to surface the cause could
never reach its boot-log tail because CircleCI runs steps under bash -eo
pipefail and the preceding `curl -sv ... | tail` aborted the step with curl's
exit 7. Booting the proxy as the job's own background step lets any boot crash
land in that step's log instead of being swallowed.

The default e2e_ui_testing job is unchanged aside from dropping the reboot,
prefixed-readiness, and prefixed-smoke steps; the migration smoke still runs at
the root mount there via the default Playwright config.

* fix(proxy): extend response headers hook to streaming, TTS, image gen, and pass-through (#24232)

* fix(proxy): extend response headers hook to streaming, TTS, image gen, and pass-through

* test: mock post_call_response_headers_hook in audio speech route tests

* chore(ui): remove dead App Router route stubs under (dashboard) (#30045)

models-and-endpoints, organizations, and virtual-keys each had a page.tsx
route under (dashboard)/ that is not in MIGRATED_PAGES, so the sidebar and
deep links never resolve to it and the route is unreachable. Each was a thin
wrapper that handed the shared view empty or no-op props (empty modelData with
a no-op setModelData, hardcoded empty organizations, no-op
setUserRole/setUserEmail), so reaching one would render a degraded page in any
case. The real wrapper belongs in the PR that flips each page into
MIGRATED_PAGES, written with eyes on it and a test

This continues the dead-scaffolding cleanup from #28891. The shared components
these wrappers rendered (ModelsAndEndpointsView, OrganizationFilters) stay,
since the legacy ?page= switch in app/page.tsx and src/components still import
them

* fix(ui/mcp): reset OAuth state on create-server modal close so a prior server's token no longer leaks into the next add-server session (#30000)

* fix(ui/mcp): reset OAuth hook state on modal close so a prior server's token no longer leaks into the next add-server session

* fix(ui/mcp): clear in-flight OAuth guard on reset and reset form/tools on modal close so nothing leaks on a parent-driven dismiss

* fix(mcp): allow team access-group grants in OAuth authorize/token access check (#30041)

* fix(mcp): honor team access-group grants in OAuth authorize/token access check

* test(mcp): mock build_effective_auth_contexts in non-admin authorize tests for isolation

* docs(security): require a reproduction video for vulnerability reports (#30048) (#30063)

With AI models capable of automated vulnerability discovery now publicly
available, we expect a large increase in report volume, much of it
unverified. Requiring a video of the exploit running against a live
instance raises the bar for submissions and keeps triage focused on
reproducible issues. Reports without a video will be closed and reopened
if one is added later.

Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com>

* feat(ui): add admin flag to disable in-product UI nudges for everyone (#29796)

* feat(ui): add admin flag to disable in-product UI nudges for everyone

Admins can now suppress the survey and Claude Code feedback popups for
all users via a single disable_ui_nudges UI setting, instead of relying
on each user dismissing them individually.

* fix(ui): suppress nudges while ui settings are loading

Gate nudgesDisabled on the ui-settings loading state so an admin with
disable_ui_nudges on doesn't see the survey prompt flash, and the
getInProductNudgesCall fetch doesn't fire, on a cold page load before
the flag resolves. Falls back to showing nudges if the fetch errors.

* test(ui): wrap CreateKeyPage test in QueryClientProvider

page.tsx now calls useUISettings (react-query), which needs a
QueryClient that layout.tsx supplies in production but the test did
not. Add the provider and mock getUiSettings so the query resolves.

* chore(ui): remove dead dashboard files and unused dependencies (#30047)

* chore(ui): remove dead dashboard files and unused dependencies

knip flagged seven orphaned source/config files with no importers and
five declared dependencies that nothing in the tree uses. Removing them
shrinks the dashboard bundle's source surface and keeps the manifest
honest; vite stays installed transitively via vitest, so test tooling is
unaffected.

* fix(ci): restore serverRootPath.config.ts referenced by SERVER_ROOT_PATH workflow

The dead-code sweep removed e2e_tests/serverRootPath.config.ts, but its spec
(tests/login/serverRootPathRedirect.spec.ts) and the test_server_root_path.yml
workflow step still depend on it, so the redirect e2e job failed to load a
config that no longer existed.

* fix(proxy): authorize batch files using upload target_model_names (LIT-3593) (#30009)

* fix(proxy): authorize batch files using upload target_model_names (LIT-3593)

After replace_model_in_jsonl, body.model is a stripped provider id. Reverse-mapping it via resolve_model_name_from_model_id is first-match on model_list and caused false 403s when multiple deployments share the same stripped name. Use target_model_names from the unified file id instead.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(proxy): restore resolve_model_name_from_model_id for JSONL fallback path (LIT-3593)

Restores the reverse-lookup for the JSONL body.model fallback path so that
legacy/pre-target_model_names managed files still map stripped provider IDs
back to proxy aliases before auth. Also cleans up redundant `or None`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Revert "fix(proxy): restore resolve_model_name_from_model_id for JSONL fallback path (LIT-3593)"

This reverts commit 30d2e96f77ef521ccaaf2193fe554980380eb669.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Add Claude Fable 5 across Anthropic, Bedrock, Vertex AI, and Azure AI (#30064)

* Add Claude Fable 5 across Anthropic, Bedrock, Vertex AI, and Azure AI

Adds cost map entries for claude-fable-5 ($10/$50 per MTok, 1M context,
128K output, adaptive thinking only) on the Anthropic API, Bedrock
converse (base, global, and us/eu geo inference profiles at the 10%
regional premium), Vertex AI, and Azure AI (Microsoft Foundry, which
serves Fable 5 with the full 1M context window unlike Opus 4.8).

Registers anthropic.claude-fable-5 in BEDROCK_CONVERSE_MODELS, lists the
model in the setup wizard, and extends the reasoning effort e2e grid.
The Bedrock, Vertex, and Azure grid cells carry fail_reason markers
until the CI accounts are provisioned: Bedrock needs the provider data
sharing opt-in Fable 5 requires, and the Foundry resource needs a
claude-fable-5 deployment.

The first-party entry carries provider_specific_entry {us: 1.1} for the
inference_geo premium and deliberately no fast multiplier since Fable 5
has no fast mode.

https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm

* Drop removed sampling params for Claude 4.7+ when drop_params is set

Fable 5, Opus 4.7, and Opus 4.8 removed sampling params: the API rejects
top_p, top_k, and any temperature other than 1 with a 400. LiteLLM was
forwarding them even with drop_params enabled because the Anthropic and
Bedrock converse transformations passed temperature/top_p through
unconditionally.

Mirror the GPT-5/o-series handling: temperature=1 still passes through,
other values and any top_p are dropped when drop_params is set, and
without drop_params a clean client-side UnsupportedParamsError tells the
caller how to opt in, instead of surfacing the raw provider error.

https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm

* Drive sampling param gating from the cost map and cover top_k

Greptile review follow-ups on the sampling param fix: the restriction for
Fable 5 / Opus 4.7 / 4.8 is now declared as supports_sampling_params: false
on every affected cost map entry (perplexity excluded; that route is
OpenAI-compatible and maps sampling params upstream) and read back through
a tri-state map lookup, keeping the name check only as a fallback for
provider-routed ids whose hosted map entries predate the flag, the same
layering supports_adaptive_thinking uses. top_k bypasses map_openai_params
as a provider-specific kwarg, so it is gated at the shared
AnthropicConfig.transform_request boundary (direct, Bedrock invoke, Vertex,
Azure) and in the Bedrock converse _handle_top_k_value path, with
drop_params threaded through the converse transform helpers.

Also updates the reasoning effort grid cell count assertion for the four
Fable 5 rows added on this branch (29 x 11 cells).

https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm

* Declare supports_sampling_params in the cost map schema

The model map validation schema uses additionalProperties: false, so the
new flag must be declared for the 28 entries that carry it; this was the
one failing job (misc / Run tests) on the previous commit.

https://claude.ai/code/session_01MZarYYT3aS7DxaNjoax6Gm

* fix(bedrock): gate top_k=0 on converse to match Anthropic boundary

Truthiness check let top_k=0 silently disappear on models that removed
sampling params, while AnthropicConfig.transform_request treats 0 as
present and raises UnsupportedParamsError (or drops when drop_params is
set). Switch to 'is not None' so converse, direct Anthropic, invoke,
Vertex, and Azure all behave the same for top_k=0.

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>

* fix(proxy): coerce litellm_settings.max_budget env var to float

When max_budget is set in litellm_settings via os.environ/MAX_BUDGET,
the env var resolves to a string and the generic setattr branch in
ProxyConfig.load_config stored it as-is, so the startup check
litellm.max_budget > 0 raised TypeError. The earlier fix (#23855) only
covered the CLI initialize() path. Coerce the value to float in the
settings loop, matching the existing max_internal_user_budget handling.

Fixes #26696.

---------

Co-authored-by: milan-berri <milan@berri.ai>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: ryan-crabbe-berri <ryan@berri.ai>
Co-authored-by: michelligabriele <gabriele.michelli@icloud.com>
Co-authored-by: tin-berri <tin@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com>
Co-authored-by: Sameer Kankute <sameer@berri.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com>

* fix(router): don't drop bedrock pass-through deployments using IAM credentials (#30111)

* Fix Bedrock passthrough deployment dropped when using IAM credentials

Bedrock deployments with use_in_pass_through enabled and IAM/OIDC auth
(aws_role_name, no api_key) hit the generic pass-through branch in
Router._initialize_deployment_for_pass_through, which calls
set_pass_through_credentials and raises "api_key is required". The
exception drops the deployment from the router entirely, breaking both
passthrough and normal routing for that model.

Skip the credential store write when no api_key is set; the bedrock
passthrough route resolves AWS credentials at request time via
BedrockConverseLLM.get_credentials(), not the passthrough credential
store, so there is nothing to register here.

Fixes #27728.

* Reset passthrough credentials singleton before api_key credential test

The test reads the module-level passthrough_endpoint_router singleton,
so a stale "openai" entry written by an earlier test in the same
process could make the assertion pass without exercising the code path.
Clearing the credentials dict up front makes the test order-independent.

* fix(sdk): stop mirroring reasoning_content in provider_specific_fields (#30110)

The dict-to-response conversion path mirrored reasoning_content into
provider_specific_fields, while live provider transforms (Anthropic's
_build_provider_specific_fields) only set it top-level on the Message.
Cache-replayed messages therefore serialized differently from live
ones, breaking disk cache key stability for multi-turn conversations
with extended thinking.

The mirror was added for DeepSeek before Message.reasoning_content
existed as a top-level attribute. The top-level field is still set by
the converter, so DeepSeek's request-side promotion is unaffected.

Fixes #27337.

* fix(mcp): coerce mcp_server_cost_info values to float at ingest (#30109)

* fix(mcp): coerce mcp_server_cost_info values to float at ingest

YAML 1.1 parses scientific notation without a decimal point
(e.g. 7e-05) as a string, and MCPServerCostInfo is a TypedDict with no
runtime validation, so a string-typed default_cost_per_query from
config.yaml flowed through the proxy untouched and crashed the MCP
server settings page with '.toFixed is not a function'. Normalize
mcp_server_cost_info on both the config and DB load paths, dropping
non-numeric values with a warning instead of failing the server load.

Fixes #27097.

* fix(mcp): drop non-numeric default_cost_per_query instead of nulling it

Keeping the key with a None value still exposes a null to the UI,
which can crash .toFixed formatting when the consumer checks key
existence rather than truthiness. Delete the key on coercion failure,
matching how non-numeric per-tool cost entries are already omitted.

* fix(proxy): count embedding and text completion tokens toward TPM limits (#30105)

* fix(proxy): count embedding and text completion tokens toward TPM limits

The parallel request limiters only read token usage off ModelResponse,
so EmbeddingResponse and TextCompletionResponse objects left
total_tokens at 0 and the per key, user, team, and end user TPM
counters never incremented. Requests to /v1/embeddings and
/v1/completions were effectively free against any tpm_limit. In the v3
limiter this was worse: the post-call reconciliation computed actual
usage as 0 and refunded the pre-call reservation made at request time.

Broaden the isinstance checks to accept EmbeddingResponse and
TextCompletionResponse, which both expose a Usage object, at the four
per-scope sites in parallel_request_limiter.py and at the usage
extraction in parallel_request_limiter_v3.py. ResponsesAPIResponse was
already covered in v3 via BaseLiteLLMOpenAIResponseObject.

Fixes #27738.

* test(proxy): cover v1 limiter TPM counting for embedding and text completion responses

Exercise the broadened isinstance sites in parallel_request_limiter.py
by asserting that async_log_success_event adds total_tokens to the per
key, user, team, and end user TPM counters for EmbeddingResponse and
TextCompletionResponse objects. The counters are pre-seeded at zero so
the assertion is exactly the increment; on the pre-fix code these
responses left total_tokens at 0 and the test fails.

* fix(openai): forward client headers on the text completion path (#30103)

* fix(openai): forward client headers on the text completion path

litellm.completion() merges caller headers with extra_headers, but the
text-completion-openai branch never passed the merged dict to
openai_text_completions.completion(), and the handler only used its
headers argument for logging. Pass the merged headers through the call
site and set them as extra_headers on the outgoing request, mirroring
the chat completion handler, so x-* client headers forwarded by the
proxy reach the provider on /v1/completions.

Fixes #27410.

* Drop redundant extra_headers assignment and fix test module collision

completion() merges extra_headers into headers before the
text-completion-openai branch, and the handler now sets the merged
headers as extra_headers on the request, so the branch-local
optional_params["extra_headers"] assignment was a dead duplicate.
Removing it keeps the assignment in one place while both entry paths
(litellm.text_completion and direct handler callers) still forward
headers; a new regression test pins the extra_headers kwarg path.

Also rename the test module to test_completion_handler.py since its
basename collided with tests/test_litellm/llms/bedrock/batches/
test_handler.py and broke pytest collection.

* fix(bedrock): route Anthropic-shape count_tokens to InvokeModel and base64-encode the body (#30102)

* fix(bedrock): route Anthropic-shape count_tokens to InvokeModel

POST /v1/messages/count_tokens with Anthropic content blocks
({"type": "text"|"tool_use"|...}) was routed to the Converse input of
the Bedrock CountTokens API. The Converse transform copies list content
through verbatim, so Bedrock rejected the request with a 400 and the
caller silently fell back to the local tokenizer, returning counts that
can be off by ~50% on tool-heavy payloads.

_detect_input_type now routes messages whose content blocks carry a
"type" key (Anthropic shape) to the invokeModel input, which forwards
the body verbatim. The invokeModel body is now base64-encoded as the
CountTokens API requires (InvokeModelTokensRequest.body is a
base64-encoded blob), and Anthropic Messages bodies get the
anthropic_version and max_tokens fields Bedrock validates against.

Fixes #27632.

* refactor(bedrock): name the CountTokens max_tokens placeholder

Replace the magic 1024 with a module-level
DEFAULT_ANTHROPIC_INVOKE_MODEL_MAX_TOKENS constant so the intent is
explicit and there is a single place to update if Bedrock's InvokeModel
schema ever changes. Module-local rather than litellm/constants.py
because the value is only a schema-validation placeholder for token
counting, not a user-tunable generation default.

* Add above-512k pricing tier for MiniMax-M3 and correct its base rates (#30095)

* Add above-512k pricing tier support for MiniMax-M3

MiniMax-M3 doubles its per-token rates once a prompt exceeds 512k
input tokens. The tiered cost parser already handles arbitrary
thresholds, but get_model_info only copies whitelisted keys from
ModelInfoBase, which had no 512k variants, so above_512k keys were
silently dropped and long-context requests were priced at the flat
rate.

Add the input, output, and cache-read above_512k_tokens fields to
ModelInfoBase and pass them through in get_model_info. Update the
minimax/MiniMax-M3 entry with the tiered rates and correct the base
rates, which matched the above-512k tier instead of the published
base tier (https://platform.minimax.io/docs/guides/pricing-paygo).

Fixes #29663.

* Add above-512k keys to pricing schema, set MiniMax-M3 context to 1M

Register the three new above_512k_tokens cost keys in the INTENDED_SCHEMA
of test_aaamodel_prices_and_context_window_json_is_valid, declared the same
way as the existing above_200k/above_272k tier keys, so the schema check
accepts the MiniMax-M3 tiered pricing entry.

Also raise MiniMax-M3 max_input_tokens from 512000 to 1000000 in both
pricing JSONs. The MiniMax API docs
(https://platform.minimax.io/docs/guides/text-generation) state the model
supports a 1,000,000-token context window, and the pay-as-you-go pricing
page (https://platform.minimax.io/docs/guides/pricing-paygo) prices input
above 512k tokens, which only makes sense if inputs beyond 512k are
accepted. This makes the above-512k pricing tier reachable.

* fix(bedrock): make document names unique across conversation turns (#30093)

* fix(bedrock): make document names unique across conversation turns

PR #16275 derived Bedrock document names purely from a content hash so
that names stay deterministic for prompt caching. When the same PDF or
document appears in more than one conversation turn, every occurrence
gets the identical name and Bedrock rejects the request with "Messages
can not contain duplicate document names".

Add _rename_duplicate_bedrock_document_names, a post-pass over the
assembled message blocks that keeps the first occurrence's hash-based
name and appends a positional suffix (_2, _3, ...) to later
occurrences. Apply it in both _bedrock_converse_messages_pt and
_bedrock_converse_messages_pt_async. Names remain deterministic across
requests and the first occurrence is unchanged, so prompt cache
prefixes stay stable.

Fixes #29418.

* fix(bedrock): avoid suffix collisions with organic document names

A renamed duplicate could collide with a document whose hash-derived
name already ends in the same positional suffix (e.g. an organic
report_2 next to two documents named report). Collect every document
name up front and bump the suffix until the candidate is unused, so
renames can collide neither with organic names nor with each other.

* fix(_types): remove ResponsesAPIResponse from PassThroughEndpointLoggingResultValues

The import of ResponsesAPIResponse was removed from the file but a usage
was left in the Union type, causing a NameError on import and breaking
all CI tests. Remove the stale reference to match the cleanup intent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(_types): restore ResponsesAPIResponse import and add use_xai_oauth to filter list

Two related fixes:
1. Re-add ResponsesAPIResponse import in _types.py — it was removed but still
   needed in PassThroughEndpointLoggingResultValues (used in
   openai_passthrough_logging_handler.py).
2. Add use_xai_oauth to all_litellm_params so it is filtered before forwarding
   kwargs to providers like OpenAI that do not recognize it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Hari <kancharla.ha@northeastern.edu>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Ceder Dens <ceder.dens@uantwerpen.be>
Co-authored-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
Co-authored-by: 冯基魁 <56265583+fengjikui@users.noreply.github.com>
Co-authored-by: victoruce <161634297+victoruce@users.noreply.github.com>
Co-authored-by: kejunleng <33445544+silencedoctor@users.noreply.github.com>
Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: Tyson Cung <45380903+tysoncung@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Jeremy Chapeau <113923302+jychp@users.noreply.github.com>
Co-authored-by: Daan <255322319+daanhendrio@users.noreply.github.com>
Co-authored-by: Avani Prajapati <143805019+Avani-prajapati@users.noreply.github.com>
Co-authored-by: Kent <72616338+kingdoooo@users.noreply.github.com>
Co-authored-by: daitran-tensormesh <dai@tensormesh.ai>
Co-authored-by: Dimitris Spachos <dspachos@gmail.com>
Co-authored-by: Liam Scott <liam@uilliam.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com>
Co-authored-by: milan-berri <milan@berri.ai>
Co-authored-by: ryan-crabbe-berri <ryan@berri.ai>
Co-authored-by: michelligabriele <gabriele.michelli@icloud.com>
Co-authored-by: tin-berri <tin@berri.ai>
Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com>
Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com>
2026-06-10 10:34:07 -07:00

2375 lines
91 KiB
Python

import copy
import datetime
from typing import AsyncGenerator
from unittest.mock import AsyncMock, MagicMock, patch
import httpx
import pytest
from fastapi import HTTPException, Request, Response, status
from fastapi.responses import JSONResponse, StreamingResponse
import litellm
from litellm._uuid import uuid
from litellm.integrations.custom_logger import CustomLogger
from litellm.integrations.opentelemetry import UserAPIKeyAuth
from litellm.proxy.common_request_processing import (
ProxyBaseLLMRequestProcessing,
ProxyConfig,
_extract_error_from_sse_chunk,
_get_cost_breakdown_from_logging_obj,
_has_attribute_error_in_chain,
_is_azure_model_router_request,
_override_openai_response_model,
_parse_event_data_for_error,
create_response,
)
from litellm.proxy.dd_span_tagger import DDSpanTagger
from litellm.proxy.utils import ProxyLogging
class TestProxyBaseLLMRequestProcessing:
@pytest.mark.asyncio
async def test_base_passthrough_process_llm_request_preserves_litellm_headers_for_non_streaming_response(
self, monkeypatch
):
processing_obj = ProxyBaseLLMRequestProcessing(data={})
async def fake_base_process_llm_request(**kwargs):
passthrough_response = kwargs["fastapi_response"]
passthrough_response.headers["x-litellm-call-id"] = "test-call-id"
passthrough_response.headers["x-litellm-version"] = "test-version"
return httpx.Response(
status_code=200,
content=b'{"ok":true}',
headers={
"content-type": "application/json",
"x-amzn-requestid": "bedrock-request-id",
},
)
monkeypatch.setattr(
processing_obj,
"base_process_llm_request",
fake_base_process_llm_request,
)
result = await processing_obj.base_passthrough_process_llm_request(
request=MagicMock(spec=Request),
fastapi_response=Response(),
user_api_key_dict=MagicMock(spec=UserAPIKeyAuth),
proxy_logging_obj=MagicMock(spec=ProxyLogging),
general_settings={},
proxy_config=MagicMock(spec=ProxyConfig),
select_data_generator=MagicMock(),
model="bedrock-test-model",
)
assert result.status_code == 200
assert result.body == b'{"ok":true}'
assert result.headers["x-amzn-requestid"] == "bedrock-request-id"
assert result.headers["x-litellm-call-id"] == "test-call-id"
assert result.headers["x-litellm-version"] == "test-version"
@pytest.mark.asyncio
async def test_common_processing_pre_call_logic_pre_call_hook_receives_litellm_call_id(
self, monkeypatch
):
processing_obj = ProxyBaseLLMRequestProcessing(data={})
mock_request = MagicMock(spec=Request)
mock_request.headers = {}
async def mock_add_litellm_data_to_request(*args, **kwargs):
return {}
async def mock_common_processing_pre_call_logic(
user_api_key_dict, data, call_type
):
data_copy = copy.deepcopy(data)
return data_copy
mock_proxy_logging_obj = MagicMock(spec=ProxyLogging)
mock_proxy_logging_obj.pre_call_hook = AsyncMock(
side_effect=mock_common_processing_pre_call_logic
)
monkeypatch.setattr(
litellm.proxy.common_request_processing,
"add_litellm_data_to_request",
mock_add_litellm_data_to_request,
)
mock_general_settings = {}
mock_user_api_key_dict = MagicMock(spec=UserAPIKeyAuth)
mock_proxy_config = MagicMock(spec=ProxyConfig)
route_type = "acompletion"
# Call the actual method.
(
returned_data,
logging_obj,
) = await processing_obj.common_processing_pre_call_logic(
request=mock_request,
general_settings=mock_general_settings,
user_api_key_dict=mock_user_api_key_dict,
proxy_logging_obj=mock_proxy_logging_obj,
proxy_config=mock_proxy_config,
route_type=route_type,
)
mock_proxy_logging_obj.pre_call_hook.assert_called_once()
_, call_kwargs = mock_proxy_logging_obj.pre_call_hook.call_args
data_passed = call_kwargs.get("data", {})
assert "litellm_call_id" in data_passed
try:
uuid.UUID(data_passed["litellm_call_id"])
except ValueError:
pytest.fail("litellm_call_id is not a valid UUID")
assert data_passed["litellm_call_id"] == returned_data["litellm_call_id"]
def test_add_dd_apm_tags_for_litellm_call_id_uses_dd_tracing_helper(
self, monkeypatch
):
mock_set_active_span_tag = MagicMock(return_value=True)
import litellm.proxy.dd_span_tagger
monkeypatch.setattr(
litellm.proxy.dd_span_tagger,
"set_active_span_tag",
mock_set_active_span_tag,
)
DDSpanTagger.tag_call_id("test-call-id")
mock_set_active_span_tag.assert_called_once_with(
"litellm.call_id", "test-call-id"
)
@pytest.mark.asyncio
async def test_should_apply_hierarchical_router_settings_as_override(
self, monkeypatch
):
"""
Test that hierarchical router settings are stored as router_settings_override
instead of creating a full user_config with model_list.
This approach avoids expensive per-request Router instantiation by passing
settings as kwargs overrides to the main router.
"""
processing_obj = ProxyBaseLLMRequestProcessing(data={})
mock_request = MagicMock(spec=Request)
mock_request.headers = {}
async def mock_add_litellm_data_to_request(*args, **kwargs):
return {}
async def mock_common_processing_pre_call_logic(
user_api_key_dict, data, call_type
):
data_copy = copy.deepcopy(data)
return data_copy
mock_proxy_logging_obj = MagicMock(spec=ProxyLogging)
mock_proxy_logging_obj.pre_call_hook = AsyncMock(
side_effect=mock_common_processing_pre_call_logic
)
monkeypatch.setattr(
litellm.proxy.common_request_processing,
"add_litellm_data_to_request",
mock_add_litellm_data_to_request,
)
mock_general_settings = {}
mock_user_api_key_dict = MagicMock(spec=UserAPIKeyAuth)
mock_proxy_config = MagicMock(spec=ProxyConfig)
mock_router_settings = {
"routing_strategy": "least-busy",
"timeout": 30.0,
"num_retries": 3,
}
mock_proxy_config._get_hierarchical_router_settings = AsyncMock(
return_value=mock_router_settings
)
mock_llm_router = MagicMock()
mock_prisma_client = MagicMock()
monkeypatch.setattr(
"litellm.proxy.proxy_server.prisma_client",
mock_prisma_client,
)
route_type = "acompletion"
(
returned_data,
logging_obj,
) = await processing_obj.common_processing_pre_call_logic(
request=mock_request,
general_settings=mock_general_settings,
user_api_key_dict=mock_user_api_key_dict,
proxy_logging_obj=mock_proxy_logging_obj,
proxy_config=mock_proxy_config,
route_type=route_type,
llm_router=mock_llm_router,
)
mock_proxy_config._get_hierarchical_router_settings.assert_called_once_with(
user_api_key_dict=mock_user_api_key_dict,
prisma_client=mock_prisma_client,
proxy_logging_obj=mock_proxy_logging_obj,
)
# get_model_list should NOT be called - we no longer copy model list for per-request routers
mock_llm_router.get_model_list.assert_not_called()
# Settings should be stored as router_settings_override (not user_config)
# This allows passing them as kwargs to the main router instead of creating a new one
assert "router_settings_override" in returned_data
assert "user_config" not in returned_data
router_settings_override = returned_data["router_settings_override"]
assert router_settings_override["routing_strategy"] == "least-busy"
assert router_settings_override["timeout"] == 30.0
assert router_settings_override["num_retries"] == 3
# model_list should NOT be in the override settings
assert "model_list" not in router_settings_override
@pytest.mark.asyncio
async def test_stream_timeout_header_processing(self):
"""
Test that x-litellm-stream-timeout header gets processed and added to request data as stream_timeout.
"""
from litellm.proxy.litellm_pre_call_utils import LiteLLMProxyRequestSetup
# Test with stream timeout header
headers_with_timeout = {"x-litellm-stream-timeout": "30.5"}
result = LiteLLMProxyRequestSetup._get_stream_timeout_from_request(
headers_with_timeout
)
assert result == 30.5
# Test without stream timeout header
headers_without_timeout = {}
result = LiteLLMProxyRequestSetup._get_stream_timeout_from_request(
headers_without_timeout
)
assert result is None
# Test with invalid header value (should raise ValueError when converting to float)
headers_with_invalid = {"x-litellm-stream-timeout": "invalid"}
with pytest.raises(ValueError):
LiteLLMProxyRequestSetup._get_stream_timeout_from_request(
headers_with_invalid
)
@pytest.mark.asyncio
async def test_build_litellm_proxy_success_headers_from_llm_response(self):
"""
Google native :generateContent uses this helper instead of base_process_llm_request;
ensure x-litellm-* headers and callback hooks merge like the main proxy path.
"""
mock_request = MagicMock(spec=Request)
mock_request.headers = {}
class _FakeGenaiResponse:
_hidden_params = {
"model_id": "deployment-model-id",
"cache_key": "ck-test",
"api_base": "https://generativelanguage.googleapis.com/v1beta",
"response_cost": 0.001,
"additional_headers": {"llm_provider-ratelimit-requests": "1000"},
}
logging_obj = MagicMock()
logging_obj.litellm_call_id = "call-id-test"
mock_user = MagicMock()
mock_user.tpm_limit = None
mock_user.rpm_limit = None
mock_user.max_budget = None
mock_user.spend = 0.0
mock_user.allowed_model_region = None
proxy_logging_obj = MagicMock(spec=ProxyLogging)
proxy_logging_obj.post_call_response_headers_hook = AsyncMock(
return_value={"x-ratelimit-remaining-requests": "999"}
)
headers = await ProxyBaseLLMRequestProcessing.build_litellm_proxy_success_headers_from_llm_response(
response=_FakeGenaiResponse(),
request_data={"model": "gemini/gemini-1.5-flash"},
request=mock_request,
user_api_key_dict=mock_user,
logging_obj=logging_obj,
version="9.9.9",
proxy_logging_obj=proxy_logging_obj,
)
assert headers["x-litellm-call-id"] == "call-id-test"
assert headers["x-litellm-model-id"] == "deployment-model-id"
assert headers["x-litellm-version"] == "9.9.9"
assert headers["llm_provider-ratelimit-requests"] == "1000"
assert headers["x-ratelimit-remaining-requests"] == "999"
proxy_logging_obj.post_call_response_headers_hook.assert_awaited_once()
@pytest.mark.asyncio
async def test_build_litellm_proxy_success_headers_streaming_style_iterator(self):
"""AsyncGoogleGenAIGenerateContentStreamingIterator sets _hidden_params at init; headers must propagate."""
class _FakeStreamLike:
def __aiter__(self):
return self
async def __anext__(self):
raise StopAsyncIteration
_hidden_params = {
"model_id": "stream-model-id",
"api_base": "https://generativelanguage.googleapis.com/v1beta",
"cache_key": "",
"response_cost": "",
"additional_headers": {"llm_provider-x": "y"},
}
mock_request = MagicMock(spec=Request)
mock_request.headers = {}
logging_obj = MagicMock()
logging_obj.litellm_call_id = "cid-stream"
mock_user = MagicMock()
mock_user.tpm_limit = None
mock_user.rpm_limit = None
mock_user.max_budget = None
mock_user.spend = 0.0
mock_user.allowed_model_region = None
proxy_logging_obj = MagicMock(spec=ProxyLogging)
proxy_logging_obj.post_call_response_headers_hook = AsyncMock(return_value={})
headers = await ProxyBaseLLMRequestProcessing.build_litellm_proxy_success_headers_from_llm_response(
response=_FakeStreamLike(),
request_data={"model": "gemini/gemini-2.0-flash"},
request=mock_request,
user_api_key_dict=mock_user,
logging_obj=logging_obj,
version="1.0.0",
proxy_logging_obj=proxy_logging_obj,
)
assert headers["x-litellm-model-id"] == "stream-model-id"
assert headers["x-litellm-model-api-base"] == (
"https://generativelanguage.googleapis.com/v1beta"
)
assert headers["llm_provider-x"] == "y"
@pytest.mark.asyncio
async def test_build_litellm_proxy_success_headers_no_hidden_params_metadata_fallback(
self,
):
"""When response has no _hidden_params, model_id can still come from litellm_metadata."""
class _BareResponse:
pass
mock_request = MagicMock(spec=Request)
mock_request.headers = {}
logging_obj = MagicMock()
logging_obj.litellm_call_id = "cid-meta"
mock_user = MagicMock()
mock_user.tpm_limit = None
mock_user.rpm_limit = None
mock_user.max_budget = None
mock_user.spend = 0.0
mock_user.allowed_model_region = None
proxy_logging_obj = MagicMock(spec=ProxyLogging)
proxy_logging_obj.post_call_response_headers_hook = AsyncMock(return_value={})
headers = await ProxyBaseLLMRequestProcessing.build_litellm_proxy_success_headers_from_llm_response(
response=_BareResponse(),
request_data={
"model": "gemini/gemini-1.5-flash",
"litellm_metadata": {"model_info": {"id": "meta-model-id"}},
},
request=mock_request,
user_api_key_dict=mock_user,
logging_obj=logging_obj,
version="1.0.0",
proxy_logging_obj=proxy_logging_obj,
)
assert headers["x-litellm-model-id"] == "meta-model-id"
@pytest.mark.asyncio
async def test_add_litellm_data_to_request_with_stream_timeout_header(self):
"""
Test that x-litellm-stream-timeout header gets processed and added to request data
when calling add_litellm_data_to_request.
"""
from litellm.proxy.litellm_pre_call_utils import add_litellm_data_to_request
# Create test data with a basic completion request
test_data = {
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Hello"}],
}
# Mock request with stream timeout header
mock_request = MagicMock(spec=Request)
mock_request.headers = {"x-litellm-stream-timeout": "45.0"}
mock_request.url.path = "/v1/chat/completions"
mock_request.method = "POST"
mock_request.query_params = {}
mock_request.client = None
# Create a minimal mock with just the required attributes
mock_user_api_key_dict = MagicMock()
mock_user_api_key_dict.api_key = "test_api_key_hash"
mock_user_api_key_dict.tpm_limit = None
mock_user_api_key_dict.rpm_limit = None
mock_user_api_key_dict.max_budget = None
mock_user_api_key_dict.spend = 0
mock_user_api_key_dict.allowed_model_region = None
mock_user_api_key_dict.key_alias = None
mock_user_api_key_dict.user_id = None
mock_user_api_key_dict.team_id = None
mock_user_api_key_dict.metadata = {} # Prevent enterprise feature check
mock_user_api_key_dict.team_metadata = None
mock_user_api_key_dict.org_id = None
mock_user_api_key_dict.team_alias = None
mock_user_api_key_dict.end_user_id = None
mock_user_api_key_dict.user_email = None
mock_user_api_key_dict.request_route = None
mock_user_api_key_dict.team_max_budget = None
mock_user_api_key_dict.team_spend = None
mock_user_api_key_dict.model_max_budget = None
mock_user_api_key_dict.parent_otel_span = None
mock_user_api_key_dict.team_model_aliases = None
general_settings = {}
mock_proxy_config = MagicMock()
# Call the actual function that processes headers and adds data
result_data = await add_litellm_data_to_request(
data=test_data,
request=mock_request,
general_settings=general_settings,
user_api_key_dict=mock_user_api_key_dict,
version=None,
proxy_config=mock_proxy_config,
)
# Verify that stream_timeout was extracted from header and added to request data
assert "stream_timeout" in result_data
assert result_data["stream_timeout"] == 45.0
# Verify that the original test data is preserved
assert result_data["model"] == "gpt-3.5-turbo"
assert result_data["messages"] == [{"role": "user", "content": "Hello"}]
def test_get_custom_headers_with_discount_info(self):
"""
Test that discount information is correctly extracted from logging object
and included in response headers.
"""
from litellm.litellm_core_utils.litellm_logging import (
Logging as LiteLLMLoggingObj,
)
# Create mock user API key dict
mock_user_api_key_dict = MagicMock(spec=UserAPIKeyAuth)
mock_user_api_key_dict.tpm_limit = None
mock_user_api_key_dict.rpm_limit = None
mock_user_api_key_dict.max_budget = None
mock_user_api_key_dict.spend = 0
# Create logging object with cost breakdown including discount
logging_obj = LiteLLMLoggingObj(
model="vertex_ai/gemini-pro",
messages=[{"role": "user", "content": "test"}],
stream=False,
call_type="completion",
start_time=None,
litellm_call_id="test-call-id",
function_id="test-function-id",
)
# Set cost breakdown with discount information
logging_obj.set_cost_breakdown(
input_cost=0.00005,
output_cost=0.00005,
total_cost=0.000095, # After 5% discount
cost_for_built_in_tools_cost_usd_dollar=0.0,
original_cost=0.0001,
discount_percent=0.05,
discount_amount=0.000005,
)
# Call get_custom_headers with discount info
headers = ProxyBaseLLMRequestProcessing.get_custom_headers(
user_api_key_dict=mock_user_api_key_dict,
call_id="test-call-id",
response_cost=0.000095,
litellm_logging_obj=logging_obj,
)
# Verify discount headers are present
assert "x-litellm-response-cost" in headers
assert float(headers["x-litellm-response-cost"]) == 0.000095
assert "x-litellm-response-cost-original" in headers
assert float(headers["x-litellm-response-cost-original"]) == 0.0001
assert "x-litellm-response-cost-discount-amount" in headers
assert float(headers["x-litellm-response-cost-discount-amount"]) == 0.000005
def test_get_custom_headers_without_discount_info(self):
"""
Test that when no discount is applied, discount headers are not included.
"""
from litellm.litellm_core_utils.litellm_logging import (
Logging as LiteLLMLoggingObj,
)
# Create mock user API key dict
mock_user_api_key_dict = MagicMock(spec=UserAPIKeyAuth)
mock_user_api_key_dict.tpm_limit = None
mock_user_api_key_dict.rpm_limit = None
mock_user_api_key_dict.max_budget = None
mock_user_api_key_dict.spend = 0
# Create logging object without discount
logging_obj = LiteLLMLoggingObj(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "test"}],
stream=False,
call_type="completion",
start_time=None,
litellm_call_id="test-call-id",
function_id="test-function-id",
)
# Set cost breakdown without discount information
logging_obj.set_cost_breakdown(
input_cost=0.00005,
output_cost=0.00005,
total_cost=0.0001,
cost_for_built_in_tools_cost_usd_dollar=0.0,
)
# Call get_custom_headers
headers = ProxyBaseLLMRequestProcessing.get_custom_headers(
user_api_key_dict=mock_user_api_key_dict,
call_id="test-call-id",
response_cost=0.0001,
litellm_logging_obj=logging_obj,
)
# Verify discount headers are NOT present
assert "x-litellm-response-cost" in headers
assert float(headers["x-litellm-response-cost"]) == 0.0001
# Discount headers should not be in the final dict
assert "x-litellm-response-cost-original" not in headers
assert "x-litellm-response-cost-discount-amount" not in headers
def test_get_custom_headers_with_margin_info(self):
"""
Test that margin headers are included when margin is applied.
"""
from litellm.litellm_core_utils.litellm_logging import (
Logging as LiteLLMLoggingObj,
)
# Create mock user API key dict
mock_user_api_key_dict = MagicMock(spec=UserAPIKeyAuth)
mock_user_api_key_dict.tpm_limit = None
mock_user_api_key_dict.rpm_limit = None
mock_user_api_key_dict.max_budget = None
mock_user_api_key_dict.spend = 0
# Create logging object with margin
logging_obj = LiteLLMLoggingObj(
model="gpt-4",
messages=[],
stream=False,
call_type="completion",
start_time=None,
litellm_call_id="test-call-id-margin",
function_id="test-function",
)
logging_obj.set_cost_breakdown(
input_cost=0.00005,
output_cost=0.00005,
total_cost=0.00011,
cost_for_built_in_tools_cost_usd_dollar=0.0,
original_cost=0.0001,
margin_percent=0.10,
margin_total_amount=0.00001,
)
headers = ProxyBaseLLMRequestProcessing.get_custom_headers(
user_api_key_dict=mock_user_api_key_dict,
response_cost=0.00011,
litellm_logging_obj=logging_obj,
)
# Verify margin headers are present
assert "x-litellm-response-cost" in headers
assert float(headers["x-litellm-response-cost"]) == 0.00011
assert "x-litellm-response-cost-margin-amount" in headers
assert float(headers["x-litellm-response-cost-margin-amount"]) == 0.00001
assert "x-litellm-response-cost-margin-percent" in headers
assert float(headers["x-litellm-response-cost-margin-percent"]) == 0.10
def test_get_custom_headers_without_margin_info(self):
"""
Test that when no margin is applied, margin headers are not included.
"""
from litellm.litellm_core_utils.litellm_logging import (
Logging as LiteLLMLoggingObj,
)
# Create mock user API key dict
mock_user_api_key_dict = MagicMock(spec=UserAPIKeyAuth)
mock_user_api_key_dict.tpm_limit = None
mock_user_api_key_dict.rpm_limit = None
mock_user_api_key_dict.max_budget = None
mock_user_api_key_dict.spend = 0
# Create logging object without margin
logging_obj = LiteLLMLoggingObj(
model="gpt-4",
messages=[],
stream=False,
call_type="completion",
start_time=None,
litellm_call_id="test-call-id-no-margin",
function_id="test-function",
)
logging_obj.set_cost_breakdown(
input_cost=0.00005,
output_cost=0.00005,
total_cost=0.0001,
cost_for_built_in_tools_cost_usd_dollar=0.0,
)
headers = ProxyBaseLLMRequestProcessing.get_custom_headers(
user_api_key_dict=mock_user_api_key_dict,
response_cost=0.0001,
litellm_logging_obj=logging_obj,
)
# Verify margin headers are not present
assert "x-litellm-response-cost-margin-amount" not in headers
assert "x-litellm-response-cost-margin-percent" not in headers
def test_get_cost_breakdown_from_logging_obj_helper(self):
"""
Test the helper function that extracts cost breakdown information.
"""
from litellm.litellm_core_utils.litellm_logging import (
Logging as LiteLLMLoggingObj,
)
# Test with discount info
logging_obj = LiteLLMLoggingObj(
model="vertex_ai/gemini-pro",
messages=[{"role": "user", "content": "test"}],
stream=False,
call_type="completion",
start_time=None,
litellm_call_id="test-call-id",
function_id="test-function-id",
)
logging_obj.set_cost_breakdown(
input_cost=0.00005,
output_cost=0.00005,
total_cost=0.000095,
cost_for_built_in_tools_cost_usd_dollar=0.0,
original_cost=0.0001,
discount_percent=0.05,
discount_amount=0.000005,
)
(
original_cost,
discount_amount,
margin_total_amount,
margin_percent,
) = _get_cost_breakdown_from_logging_obj(logging_obj)
assert original_cost == 0.0001
assert discount_amount == 0.000005
assert margin_total_amount is None
assert margin_percent is None
# Test with margin info
logging_obj_with_margin = LiteLLMLoggingObj(
model="gpt-4",
messages=[{"role": "user", "content": "test"}],
stream=False,
call_type="completion",
start_time=None,
litellm_call_id="test-call-id-margin",
function_id="test-function-id-margin",
)
logging_obj_with_margin.set_cost_breakdown(
input_cost=0.00005,
output_cost=0.00005,
total_cost=0.00011,
cost_for_built_in_tools_cost_usd_dollar=0.0,
original_cost=0.0001,
margin_percent=0.10,
margin_total_amount=0.00001,
)
(
original_cost,
discount_amount,
margin_total_amount,
margin_percent,
) = _get_cost_breakdown_from_logging_obj(logging_obj_with_margin)
assert original_cost == 0.0001
assert discount_amount is None
assert margin_total_amount == 0.00001
assert margin_percent == 0.10
# Test with no discount or margin info
logging_obj_no_discount = LiteLLMLoggingObj(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "test"}],
stream=False,
call_type="completion",
start_time=None,
litellm_call_id="test-call-id-2",
function_id="test-function-id-2",
)
logging_obj_no_discount.set_cost_breakdown(
input_cost=0.00005,
output_cost=0.00005,
total_cost=0.0001,
cost_for_built_in_tools_cost_usd_dollar=0.0,
)
(
original_cost,
discount_amount,
margin_total_amount,
margin_percent,
) = _get_cost_breakdown_from_logging_obj(logging_obj_no_discount)
assert original_cost is None
assert discount_amount is None
assert margin_total_amount is None
assert margin_percent is None
# Test with None logging object
(
original_cost,
discount_amount,
margin_total_amount,
margin_percent,
) = _get_cost_breakdown_from_logging_obj(None)
assert original_cost is None
assert discount_amount is None
assert margin_total_amount is None
assert margin_percent is None
def test_get_custom_headers_key_spend_includes_response_cost(self):
"""
Test that x-litellm-key-spend header includes the current request's response_cost.
This ensures that the spend header reflects the updated spend including the current
request, even though spend tracking updates happen asynchronously after the response.
"""
# Create mock user API key dict with initial spend
mock_user_api_key_dict = MagicMock(spec=UserAPIKeyAuth)
mock_user_api_key_dict.tpm_limit = None
mock_user_api_key_dict.rpm_limit = None
mock_user_api_key_dict.max_budget = None
mock_user_api_key_dict.spend = 0.001 # Initial spend: $0.001
# Test case 1: response_cost is provided as float
response_cost_1 = 0.0005 # Current request cost: $0.0005
headers_1 = ProxyBaseLLMRequestProcessing.get_custom_headers(
user_api_key_dict=mock_user_api_key_dict,
call_id="test-call-id-1",
response_cost=response_cost_1,
)
assert "x-litellm-key-spend" in headers_1
expected_spend_1 = 0.001 + 0.0005 # Initial spend + current request cost
assert float(headers_1["x-litellm-key-spend"]) == pytest.approx(
expected_spend_1, abs=1e-10
)
assert float(headers_1["x-litellm-response-cost"]) == response_cost_1
# Test case 2: response_cost is provided as string
response_cost_2 = "0.0003" # Current request cost as string
headers_2 = ProxyBaseLLMRequestProcessing.get_custom_headers(
user_api_key_dict=mock_user_api_key_dict,
call_id="test-call-id-2",
response_cost=response_cost_2,
)
assert "x-litellm-key-spend" in headers_2
expected_spend_2 = 0.001 + 0.0003 # Initial spend + current request cost
assert float(headers_2["x-litellm-key-spend"]) == pytest.approx(
expected_spend_2, abs=1e-10
)
# Test case 3: response_cost is None (should use original spend)
headers_3 = ProxyBaseLLMRequestProcessing.get_custom_headers(
user_api_key_dict=mock_user_api_key_dict,
call_id="test-call-id-3",
response_cost=None,
)
assert "x-litellm-key-spend" in headers_3
assert (
float(headers_3["x-litellm-key-spend"]) == 0.001
) # Should use original spend
# Test case 4: response_cost is 0 (should not change spend)
headers_4 = ProxyBaseLLMRequestProcessing.get_custom_headers(
user_api_key_dict=mock_user_api_key_dict,
call_id="test-call-id-4",
response_cost=0.0,
)
assert "x-litellm-key-spend" in headers_4
assert (
float(headers_4["x-litellm-key-spend"]) == 0.001
) # Should remain unchanged for 0 cost
# Test case 5: user_api_key_dict.spend is None (should default to 0.0)
mock_user_api_key_dict.spend = None
headers_5 = ProxyBaseLLMRequestProcessing.get_custom_headers(
user_api_key_dict=mock_user_api_key_dict,
call_id="test-call-id-5",
response_cost=0.0002,
)
assert "x-litellm-key-spend" in headers_5
assert float(headers_5["x-litellm-key-spend"]) == 0.0002 # 0.0 + 0.0002
# Test case 6: response_cost is negative (should not be added, use original spend)
mock_user_api_key_dict.spend = 0.001
headers_6 = ProxyBaseLLMRequestProcessing.get_custom_headers(
user_api_key_dict=mock_user_api_key_dict,
call_id="test-call-id-6",
response_cost=-0.0001, # Negative cost (should not be added)
)
assert "x-litellm-key-spend" in headers_6
assert (
float(headers_6["x-litellm-key-spend"]) == 0.001
) # Should use original spend
# Test case 7: response_cost is invalid string (should fallback to original spend)
headers_7 = ProxyBaseLLMRequestProcessing.get_custom_headers(
user_api_key_dict=mock_user_api_key_dict,
call_id="test-call-id-7",
response_cost="invalid", # Invalid string
)
assert "x-litellm-key-spend" in headers_7
assert (
float(headers_7["x-litellm-key-spend"]) == 0.001
) # Should use original spend on error
@pytest.mark.asyncio
async def test_queue_time_seconds_is_set_in_metadata(self, monkeypatch):
"""
Test that queue_time_seconds is correctly calculated and stored in metadata
after add_litellm_data_to_request populates arrival_time.
This verifies the fix for the bug where queue_time_seconds was always None
because arrival_time was read BEFORE add_litellm_data_to_request set it.
"""
processing_obj = ProxyBaseLLMRequestProcessing(data={})
mock_request = MagicMock(spec=Request)
mock_request.headers = {}
mock_request.url = MagicMock()
mock_request.url.path = "/v1/chat/completions"
async def mock_add_litellm_data_to_request(*args, **kwargs):
data = kwargs.get("data", args[0] if args else {})
# Simulate what add_litellm_data_to_request does: set arrival_time
import time
data["proxy_server_request"] = {
"url": "/v1/chat/completions",
"method": "POST",
"headers": {},
"body": {},
"arrival_time": time.time() - 0.5, # Simulate request arrived 0.5s ago
}
data["metadata"] = data.get("metadata", {})
return data
async def mock_pre_call_hook(user_api_key_dict, data, call_type):
return copy.deepcopy(data)
mock_proxy_logging_obj = MagicMock(spec=ProxyLogging)
mock_proxy_logging_obj.pre_call_hook = AsyncMock(side_effect=mock_pre_call_hook)
monkeypatch.setattr(
litellm.proxy.common_request_processing,
"add_litellm_data_to_request",
mock_add_litellm_data_to_request,
)
mock_general_settings = {}
mock_user_api_key_dict = MagicMock(spec=UserAPIKeyAuth)
mock_proxy_config = MagicMock(spec=ProxyConfig)
route_type = "acompletion"
(
returned_data,
logging_obj,
) = await processing_obj.common_processing_pre_call_logic(
request=mock_request,
general_settings=mock_general_settings,
user_api_key_dict=mock_user_api_key_dict,
proxy_logging_obj=mock_proxy_logging_obj,
proxy_config=mock_proxy_config,
route_type=route_type,
)
# Verify queue_time_seconds is set and non-negative
metadata = returned_data.get("metadata", {})
assert (
"queue_time_seconds" in metadata
), "queue_time_seconds should be set in metadata"
assert (
metadata["queue_time_seconds"] >= 0.5
), f"queue_time_seconds should be at least 0.5, got {metadata['queue_time_seconds']}"
@pytest.mark.asyncio
class TestCommonRequestProcessingHelpers:
async def consume_stream(self, streaming_response: StreamingResponse) -> list:
content = []
async for chunk_bytes in streaming_response.body_iterator:
content.append(chunk_bytes)
return content
@pytest.mark.parametrize(
"event_line, expected_code",
[
(
'data: {"error": {"code": 400, "message": "bad request"}}',
400,
), # Valid integer code
(
'data: {"error": {"code": "401", "message": "unauthorized"}}',
401,
), # Valid string-integer code
(
'data: {"error": {"code": "invalid_code", "message": "error"}}',
None,
), # Invalid string code
(
'data: {"error": {"code": 99, "message": "too low"}}',
None,
), # Integer code too low
(
'data: {"error": {"code": 600, "message": "too high"}}',
None,
), # Integer code too high
(
'data: {"id": "123", "content": "hello"}',
None,
), # Non-error SSE event
("data: [DONE]", None), # SSE [DONE] event
("data: ", None), # SSE empty data event
(
'data: {"error": {"code": 400',
None,
), # Malformed JSON
("id: 123", None), # Non-SSE event line
(
'data: {"error": {"message": "some error"}}',
None,
), # Error event without 'code' field
(
'data: {"error": {"code": null, "message": "code is null"}}',
None,
), # Error with null code
],
)
async def test_parse_event_data_for_error(self, event_line, expected_code):
assert await _parse_event_data_for_error(event_line) == expected_code
async def test_create_streaming_response_first_chunk_is_error(self):
"""
Test that when the first chunk is an error, a JSON error response is returned
instead of an SSE streaming response
"""
async def mock_generator():
yield 'data: {"error": {"code": 403, "message": "forbidden"}}\n\n'
yield 'data: {"content": "more data"}\n\n'
yield "data: [DONE]\n\n"
response = await create_response(mock_generator(), "text/event-stream", {})
# Should return JSONResponse instead of StreamingResponse
assert isinstance(response, JSONResponse)
assert response.status_code == status.HTTP_403_FORBIDDEN
# Verify the response is in standard JSON error format
import json
body = json.loads(response.body.decode())
assert "error" in body
assert body["error"]["code"] == 403
assert body["error"]["message"] == "forbidden"
async def test_create_streaming_response_first_chunk_not_error(self):
async def mock_generator():
yield 'data: {"content": "first part"}\n\n'
yield 'data: {"content": "second part"}\n\n'
yield "data: [DONE]\n\n"
response = await create_response(mock_generator(), "text/event-stream", {})
assert response.status_code == status.HTTP_200_OK
content = await self.consume_stream(response)
assert content == [
'data: {"content": "first part"}\n\n',
'data: {"content": "second part"}\n\n',
"data: [DONE]\n\n",
]
async def test_create_streaming_response_empty_generator(self):
async def mock_generator():
if False: # Never yields
yield
# Implicitly raises StopAsyncIteration
response = await create_response(mock_generator(), "text/event-stream", {})
assert response.status_code == status.HTTP_200_OK
content = await self.consume_stream(response)
assert content == []
async def test_create_streaming_response_generator_raises_stop_async_iteration_immediately(
self,
):
mock_gen = AsyncMock()
mock_gen.__anext__.side_effect = StopAsyncIteration
response = await create_response(mock_gen, "text/event-stream", {})
assert response.status_code == status.HTTP_200_OK
content = await self.consume_stream(response)
assert content == []
async def test_create_streaming_response_generator_raises_unexpected_exception(
self,
):
mock_gen = AsyncMock()
mock_gen.__anext__.side_effect = ValueError("Test error from generator")
response = await create_response(mock_gen, "text/event-stream", {})
assert response.status_code == status.HTTP_500_INTERNAL_SERVER_ERROR
content = await self.consume_stream(response)
# Streaming SSE error frame now mirrors ProxyException.to_dict() shape
# so streaming and non-streaming surfaces emit byte-identical errors.
expected_error_data = {
"error": {
"message": "Error processing stream start",
"type": "None",
"param": "None",
"code": str(status.HTTP_500_INTERNAL_SERVER_ERROR),
}
}
assert len(content) == 2
import json
assert content[0] == f"data: {json.dumps(expected_error_data)}\n\n"
assert content[1] == "data: [DONE]\n\n"
async def test_create_streaming_response_generator_raises_http_exception(
self,
):
"""
Test that when a generator raises HTTPException, the response preserves
the original status code instead of hardcoding 500.
"""
mock_gen = AsyncMock()
mock_gen.__anext__.side_effect = HTTPException(
status_code=400, detail="Content blocked by guardrail"
)
response = await create_response(mock_gen, "text/event-stream", {})
assert response.status_code == 400
content = await self.consume_stream(response)
import json
expected_error_data = {
"error": {
"message": "Content blocked by guardrail",
"type": "None",
"param": "None",
"code": "400",
}
}
assert len(content) == 2
assert content[0] == f"data: {json.dumps(expected_error_data)}\n\n"
assert content[1] == "data: [DONE]\n\n"
async def test_create_streaming_response_http_exception_dict_detail_bedrock_shape(
self,
):
"""
Bedrock-style dict detail (with the post-L3 shape) must be preserved as
structured `provider_specific_fields` in the SSE error frame, not stringified
into a Python-repr blob inside `error.message`. Regression for case
2026-04-10-internal-bedrock-guardrail-streaming-error.
"""
import json
mock_gen = AsyncMock()
mock_gen.__anext__.side_effect = HTTPException(
status_code=400,
detail={
"error": "Violated guardrail policy",
"bedrock_guardrail_response": "Sorry, the model cannot answer this question. Prompt is blocked",
"guardrailIdentifier": "amgllac6xf3r",
"guardrailVersion": "1",
"assessments": [
{
"policy": "sensitiveInformationPolicy",
"matches": [
{
"category": "piiEntities",
"type": "NAME",
"action": "BLOCKED",
"match": "Jack",
}
],
}
],
"guardrail_name": "bedrock-pii-guard",
"guardrail_mode": "post_call",
},
)
response = await create_response(mock_gen, "text/event-stream", {})
assert response.status_code == 400
content = await self.consume_stream(response)
assert len(content) == 2
assert content[1] == "data: [DONE]\n\n"
payload = json.loads(content[0][len("data: ") :].strip())
assert payload["error"]["message"] == "Violated guardrail policy"
assert payload["error"]["code"] == "400"
psf = payload["error"]["provider_specific_fields"]
assert psf["guardrail_name"] == "bedrock-pii-guard"
assert psf["guardrail_mode"] == "post_call"
assert psf["guardrailIdentifier"] == "amgllac6xf3r"
assert psf["assessments"][0]["policy"] == "sensitiveInformationPolicy"
assert psf["assessments"][0]["matches"][0]["type"] == "NAME"
async def test_create_streaming_response_http_exception_dict_detail_nested_error_shape(
self,
):
"""PANW Prisma AIRS-style nested `{"error": {"message": ...}}` detail must
extract `error.message` as the human-readable summary while preserving the
full payload."""
import json
mock_gen = AsyncMock()
mock_gen.__anext__.side_effect = HTTPException(
status_code=400,
detail={
"error": {
"message": "MCP request blocked: no rewritable argument field present",
"type": "guardrail_violation",
"code": "panw_prisma_airs_blocked",
}
},
)
response = await create_response(mock_gen, "text/event-stream", {})
content = await self.consume_stream(response)
payload = json.loads(content[0][len("data: ") :].strip())
assert (
payload["error"]["message"]
== "MCP request blocked: no rewritable argument field present"
)
assert (
payload["error"]["provider_specific_fields"]["error"]["code"]
== "panw_prisma_airs_blocked"
)
async def test_serialize_http_exception_detail_helper(self):
"""Direct unit coverage for the L1 helper across all branches."""
from litellm.proxy.common_request_processing import (
_serialize_http_exception_detail,
)
import json as _json
assert _serialize_http_exception_detail("plain") == ("plain", None)
msg, fields = _serialize_http_exception_detail(
{"error": "Violated", "extra": "x"}
)
assert msg == "Violated"
assert fields == {"error": "Violated", "extra": "x"}
msg, fields = _serialize_http_exception_detail(
{"error": {"message": "blocked", "code": "x"}}
)
assert msg == "blocked"
assert fields == {"error": {"message": "blocked", "code": "x"}}
msg, fields = _serialize_http_exception_detail({"message": "top-level"})
assert msg == "top-level"
assert fields == {"message": "top-level"}
msg, fields = _serialize_http_exception_detail({"weird": ["a", "b"]})
assert msg == _json.dumps({"weird": ["a", "b"]})
assert fields == {"weird": ["a", "b"]}
assert _serialize_http_exception_detail(42) == ("42", None)
async def test_create_streaming_response_first_chunk_error_string_code(self):
"""
Test that when the first chunk contains a string error code, a JSON error response is returned
"""
async def mock_generator():
yield 'data: {"error": {"code": "429", "message": "too many requests"}}\n\n'
yield "data: [DONE]\n\n"
response = await create_response(mock_generator(), "text/event-stream", {})
assert isinstance(response, JSONResponse)
assert response.status_code == status.HTTP_429_TOO_MANY_REQUESTS
# Verify the response is in standard JSON error format
import json
body = json.loads(response.body.decode())
assert "error" in body
assert body["error"]["code"] == "429"
assert body["error"]["message"] == "too many requests"
async def test_create_streaming_response_custom_headers(self):
async def mock_generator():
yield 'data: {"content": "data"}\n\n'
yield "data: [DONE]\n\n"
custom_headers = {"X-Custom-Header": "TestValue"}
response = await create_response(
mock_generator(), "text/event-stream", custom_headers
)
assert response.headers["x-custom-header"] == "TestValue"
async def test_create_streaming_response_disables_proxy_buffering(self):
"""Regression for #28384: every StreamingResponse create_response returns
must carry the headers that stop nginx/ingress/Envoy from buffering the
SSE stream into one batch, while preserving caller-supplied headers."""
async def normal_stream():
yield 'data: {"content": "part"}\n\n'
yield "data: [DONE]\n\n"
async def empty_stream():
if False: # never yields -> StopAsyncIteration
yield
error_stream = AsyncMock()
error_stream.__anext__.side_effect = ValueError("boom")
for generator in (normal_stream(), empty_stream(), error_stream):
response = await create_response(
generator, "text/event-stream", {"X-Custom-Header": "keep"}
)
assert isinstance(response, StreamingResponse)
assert response.headers["x-accel-buffering"] == "no"
assert response.headers["cache-control"] == "no-cache"
assert response.headers["x-custom-header"] == "keep"
async def test_create_streaming_response_non_default_status_code(self):
async def mock_generator():
yield 'data: {"content": "data"}\n\n'
yield "data: [DONE]\n\n"
response = await create_response(
mock_generator(),
"text/event-stream",
{},
default_status_code=status.HTTP_201_CREATED,
)
assert response.status_code == status.HTTP_201_CREATED
content = await self.consume_stream(response)
assert content == [
'data: {"content": "data"}\n\n',
"data: [DONE]\n\n",
]
async def test_create_streaming_response_first_chunk_is_done(self):
async def mock_generator():
yield "data: [DONE]\n\n"
response = await create_response(mock_generator(), "text/event-stream", {})
assert response.status_code == status.HTTP_200_OK # Default status
content = await self.consume_stream(response)
assert content == ["data: [DONE]\n\n"]
async def test_create_streaming_response_first_chunk_is_empty_data(self):
async def mock_generator():
yield "data: \n\n"
yield 'data: {"content": "actual data"}\n\n'
yield "data: [DONE]\n\n"
response = await create_response(mock_generator(), "text/event-stream", {})
assert response.status_code == status.HTTP_200_OK # Default status
content = await self.consume_stream(response)
assert content == [
"data: \n\n",
'data: {"content": "actual data"}\n\n',
"data: [DONE]\n\n",
]
async def test_create_streaming_response_all_chunks_have_dd_trace(self):
"""Test that all stream chunks are wrapped with dd trace at the streaming generator level"""
from unittest.mock import patch
# Create a mock tracer
mock_tracer = MagicMock()
mock_span = MagicMock()
mock_tracer.trace.return_value.__enter__.return_value = mock_span
mock_tracer.trace.return_value.__exit__.return_value = None
# Mock generator with multiple chunks
async def mock_generator():
yield 'data: {"content": "chunk 1"}\n\n'
yield 'data: {"content": "chunk 2"}\n\n'
yield 'data: {"content": "chunk 3"}\n\n'
yield "data: [DONE]\n\n"
# Patch the tracer in the common_request_processing module. The
# per-chunk span is gated on _DD_STREAMING_TRACE_ENABLED (resolved at
# import from the real tracer, a NullTracer by default), so enable it
# explicitly to exercise the tracing path.
with (
patch("litellm.proxy.common_request_processing.tracer", mock_tracer),
patch(
"litellm.proxy.common_request_processing._DD_STREAMING_TRACE_ENABLED",
True,
),
):
response = await create_response(mock_generator(), "text/event-stream", {})
assert response.status_code == 200
# Consume the stream to trigger the tracer calls
content = await self.consume_stream(response)
# Verify all chunks are present
assert len(content) == 4
assert content[0] == 'data: {"content": "chunk 1"}\n\n'
assert content[1] == 'data: {"content": "chunk 2"}\n\n'
assert content[2] == 'data: {"content": "chunk 3"}\n\n'
assert content[3] == "data: [DONE]\n\n"
# Verify that tracer.trace was called for each chunk (4 chunks total)
assert mock_tracer.trace.call_count == 4
# Verify that each call was made with the correct operation name
actual_calls = mock_tracer.trace.call_args_list
assert len(actual_calls) == 4
for i, call in enumerate(actual_calls):
args, kwargs = call
assert (
args[0] == "streaming.chunk.yield"
), f"Call {i} should have operation name 'streaming.chunk.yield', got {args[0]}"
async def test_create_streaming_response_skips_dd_trace_when_disabled(self):
"""When DD tracing is disabled (the default), the per-chunk span
context manager is skipped entirely but all chunks still stream."""
from unittest.mock import patch
mock_tracer = MagicMock()
async def mock_generator():
yield 'data: {"content": "chunk 1"}\n\n'
yield 'data: {"content": "chunk 2"}\n\n'
yield "data: [DONE]\n\n"
with (
patch("litellm.proxy.common_request_processing.tracer", mock_tracer),
patch(
"litellm.proxy.common_request_processing._DD_STREAMING_TRACE_ENABLED",
False,
),
):
response = await create_response(mock_generator(), "text/event-stream", {})
assert response.status_code == 200
content = await self.consume_stream(response)
# All chunks stream through unchanged ...
assert content == [
'data: {"content": "chunk 1"}\n\n',
'data: {"content": "chunk 2"}\n\n',
"data: [DONE]\n\n",
]
# ... but no per-chunk span was created.
assert mock_tracer.trace.call_count == 0
async def test_create_streaming_response_dd_trace_with_error_chunk(self):
"""
Test that when the first chunk contains an error, JSONResponse is returned
and tracing is not triggered (since it's not a streaming response)
"""
from unittest.mock import patch
# Create a mock tracer
mock_tracer = MagicMock()
mock_span = MagicMock()
mock_tracer.trace.return_value.__enter__.return_value = mock_span
mock_tracer.trace.return_value.__exit__.return_value = None
# Mock generator with error in first chunk
async def mock_generator():
yield 'data: {"error": {"code": 400, "message": "bad request"}}\n\n'
yield 'data: {"content": "chunk after error"}\n\n'
yield "data: [DONE]\n\n"
# Patch the tracer in the common_request_processing module
with patch("litellm.proxy.common_request_processing.tracer", mock_tracer):
response = await create_response(mock_generator(), "text/event-stream", {})
# Should return JSONResponse instead of StreamingResponse
assert isinstance(response, JSONResponse)
assert response.status_code == 400
# Verify the response is in standard JSON error format
import json
body = json.loads(response.body.decode())
assert "error" in body
assert body["error"]["code"] == 400
assert body["error"]["message"] == "bad request"
# Since JSONResponse is returned instead of StreamingResponse, streaming tracing should not be triggered
# tracer.trace should not be called
assert mock_tracer.trace.call_count == 0
class TestExtractErrorFromSSEChunk:
"""Tests for _extract_error_from_sse_chunk function"""
def test_extract_error_from_sse_chunk_with_valid_error(self):
"""Test extracting error information from a standard SSE chunk"""
chunk = 'data: {"error": {"code": 403, "message": "forbidden", "type": "auth_error", "param": "api_key"}}\n\n'
error = _extract_error_from_sse_chunk(chunk)
assert error["code"] == 403
assert error["message"] == "forbidden"
assert error["type"] == "auth_error"
assert error["param"] == "api_key"
def test_extract_error_from_sse_chunk_with_string_code(self):
"""Test error code as string type"""
chunk = 'data: {"error": {"code": "429", "message": "too many requests"}}\n\n'
error = _extract_error_from_sse_chunk(chunk)
assert error["code"] == "429"
assert error["message"] == "too many requests"
def test_extract_error_from_sse_chunk_with_bytes(self):
"""Test input as bytes type"""
chunk = b'data: {"error": {"code": 500, "message": "internal error"}}\n\n'
error = _extract_error_from_sse_chunk(chunk)
assert error["code"] == 500
assert error["message"] == "internal error"
def test_extract_error_from_sse_chunk_with_done(self):
"""Test [DONE] marker should return default error"""
chunk = "data: [DONE]\n\n"
error = _extract_error_from_sse_chunk(chunk)
assert error["message"] == "Unknown error"
assert error["type"] == "internal_server_error"
assert error["code"] == "500"
assert error["param"] is None
def test_extract_error_from_sse_chunk_without_error_field(self):
"""Test missing error field should return default error"""
chunk = 'data: {"content": "some content"}\n\n'
error = _extract_error_from_sse_chunk(chunk)
assert error["message"] == "Unknown error"
assert error["type"] == "internal_server_error"
assert error["code"] == "500"
def test_extract_error_from_sse_chunk_with_invalid_json(self):
"""Test invalid JSON should return default error"""
chunk = "data: {invalid json}\n\n"
error = _extract_error_from_sse_chunk(chunk)
assert error["message"] == "Unknown error"
assert error["type"] == "internal_server_error"
assert error["code"] == "500"
def test_extract_error_from_sse_chunk_without_data_prefix(self):
"""Test missing 'data:' prefix should return default error"""
chunk = '{"error": {"code": 400, "message": "bad request"}}\n\n'
error = _extract_error_from_sse_chunk(chunk)
assert error["message"] == "Unknown error"
assert error["type"] == "internal_server_error"
assert error["code"] == "500"
def test_extract_error_from_sse_chunk_with_empty_string(self):
"""Test empty string should return default error"""
chunk = ""
error = _extract_error_from_sse_chunk(chunk)
assert error["message"] == "Unknown error"
assert error["type"] == "internal_server_error"
assert error["code"] == "500"
def test_extract_error_from_sse_chunk_with_minimal_error(self):
"""Test minimal error object"""
chunk = 'data: {"error": {"message": "error occurred"}}\n\n'
error = _extract_error_from_sse_chunk(chunk)
assert error["message"] == "error occurred"
# Other fields should be obtained from the original error object (if exists)
class TestOverrideOpenAIResponseModel:
"""Tests for _override_openai_response_model function"""
def test_override_model_preserves_fallback_model_when_fallback_occurred_object(
self,
):
"""
Test that when a fallback occurred (x-litellm-attempted-fallbacks > 0),
the actual model used (fallback model) is preserved instead of being
overridden with the requested model.
This is the regression test to ensure the model being called is properly
displayed when a fallback happens.
"""
requested_model = "gpt-4"
fallback_model = "gpt-3.5-turbo"
# Create a mock object response with fallback model
# _hidden_params is an attribute (not a dict key) accessed via getattr
response_obj = MagicMock()
response_obj.model = fallback_model
response_obj._hidden_params = {
"additional_headers": {"x-litellm-attempted-fallbacks": 1}
}
# Call the function - should preserve fallback model
_override_openai_response_model(
response_obj=response_obj,
requested_model=requested_model,
log_context="test_context",
)
# Verify the model was NOT overridden - should still be the fallback model
assert response_obj.model == fallback_model
assert response_obj.model != requested_model
def test_override_model_preserves_fallback_model_multiple_fallbacks(self):
"""
Test that when multiple fallbacks occurred, the actual model used
(fallback model) is preserved.
"""
requested_model = "gpt-4"
fallback_model = "claude-haiku-4-5-20251001"
# Create a mock object response with fallback model
response_obj = MagicMock()
response_obj.model = fallback_model
response_obj._hidden_params = {
"additional_headers": {
"x-litellm-attempted-fallbacks": 2 # Multiple fallbacks
}
}
# Call the function - should preserve fallback model
_override_openai_response_model(
response_obj=response_obj,
requested_model=requested_model,
log_context="test_context",
)
# Verify the model was NOT overridden - should still be the fallback model
assert response_obj.model == fallback_model
assert response_obj.model != requested_model
def test_override_model_overrides_when_no_fallback_dict(self):
"""
Test that when no fallback occurred, the model is overridden
to match the requested model (dict response).
"""
requested_model = "gpt-4"
downstream_model = "gpt-3.5-turbo"
# Create a dict response without fallback
# For dict responses, _hidden_params won't be found via getattr,
# so the fallback check won't trigger and model will be overridden
response_obj = {"model": downstream_model}
# Call the function - should override to requested model
_override_openai_response_model(
response_obj=response_obj,
requested_model=requested_model,
log_context="test_context",
)
# Verify the model WAS overridden to requested model
assert response_obj["model"] == requested_model
def test_override_model_overrides_when_no_fallback_object(self):
"""
Test that when no fallback occurred (object response), the model is overridden
to match the requested model.
"""
requested_model = "gpt-4"
downstream_model = "gpt-3.5-turbo"
# Create a mock object response without fallback
response_obj = MagicMock()
response_obj.model = downstream_model
response_obj._hidden_params = {
"additional_headers": {} # No attempted_fallbacks header
}
# Call the function - should override to requested model
_override_openai_response_model(
response_obj=response_obj,
requested_model=requested_model,
log_context="test_context",
)
# Verify the model WAS overridden to requested model
assert response_obj.model == requested_model
def test_override_model_overrides_when_attempted_fallbacks_is_zero(self):
"""
Test that when attempted_fallbacks is 0 (no fallback occurred),
the model is overridden to match the requested model.
"""
requested_model = "gpt-4"
downstream_model = "gpt-3.5-turbo"
# Create a mock object response
response_obj = MagicMock()
response_obj.model = downstream_model
response_obj._hidden_params = {
"additional_headers": {
"x-litellm-attempted-fallbacks": 0 # Zero means no fallback occurred
}
}
# Call the function - should override to requested model
_override_openai_response_model(
response_obj=response_obj,
requested_model=requested_model,
log_context="test_context",
)
# Verify the model WAS overridden to requested model
assert response_obj.model == requested_model
def test_override_model_overrides_when_attempted_fallbacks_is_none(self):
"""
Test that when attempted_fallbacks is None (not set),
the model is overridden to match the requested model.
"""
requested_model = "gpt-4"
downstream_model = "gpt-3.5-turbo"
# Create a mock object response
response_obj = MagicMock()
response_obj.model = downstream_model
response_obj._hidden_params = {
"additional_headers": {"x-litellm-attempted-fallbacks": None}
}
# Call the function - should override to requested model
_override_openai_response_model(
response_obj=response_obj,
requested_model=requested_model,
log_context="test_context",
)
# Verify the model WAS overridden to requested model
assert response_obj.model == requested_model
def test_override_model_no_hidden_params(self):
"""
Test that when _hidden_params is not present, the model is overridden
to match the requested model.
"""
requested_model = "gpt-4"
downstream_model = "gpt-3.5-turbo"
# Create a mock object response without _hidden_params
response_obj = MagicMock()
response_obj.model = downstream_model
# Don't set _hidden_params - getattr will return {}
# Call the function - should override to requested model
_override_openai_response_model(
response_obj=response_obj,
requested_model=requested_model,
log_context="test_context",
)
# Verify the model WAS overridden to requested model
assert response_obj.model == requested_model
def test_override_model_no_requested_model(self):
"""
Test that when requested_model is None or empty, the function returns early
without modifying the response.
"""
fallback_model = "gpt-3.5-turbo"
# Create a mock object response
response_obj = MagicMock()
response_obj.model = fallback_model
response_obj._hidden_params = {
"additional_headers": {"x-litellm-attempted-fallbacks": 1}
}
# Call the function with None requested_model
_override_openai_response_model(
response_obj=response_obj,
requested_model=None,
log_context="test_context",
)
# Verify the model was not changed
assert response_obj.model == fallback_model
# Call with empty string
_override_openai_response_model(
response_obj=response_obj,
requested_model="",
log_context="test_context",
)
# Verify the model was not changed
assert response_obj.model == fallback_model
def test_override_model_preserves_azure_model_router_actual_model(self):
"""
Test that when the requested model is an Azure Model Router, the actual
model used (returned in the response) is preserved instead of being
overridden.
"""
requested_model = "azure_ai/model_router"
actual_model_used = "azure_ai/gpt-5-nano-2025-08-07"
response_obj = MagicMock()
response_obj.model = actual_model_used
response_obj._hidden_params = {"additional_headers": {}}
_override_openai_response_model(
response_obj=response_obj,
requested_model=requested_model,
log_context="test_context",
)
assert response_obj.model == actual_model_used
assert response_obj.model != requested_model
def test_override_model_preserves_azure_model_router_with_deployment_name(self):
"""
Test that Azure Model Router with deployment name pattern also preserves
the actual model used.
"""
requested_model = "azure_ai/model_router/my-deployment"
actual_model_used = "azure_ai/gpt-4.1-nano-2025-04-14"
response_obj = MagicMock()
response_obj.model = actual_model_used
response_obj._hidden_params = {"additional_headers": {}}
_override_openai_response_model(
response_obj=response_obj,
requested_model=requested_model,
log_context="test_context",
)
assert response_obj.model == actual_model_used
assert response_obj.model != requested_model
def test_override_model_preserves_azure_model_router_with_hyphen(self):
"""
Test that Azure Model Router with hyphen pattern (model-router) also preserves
the actual model used.
"""
requested_model = "azure_ai/model-router"
actual_model_used = "azure_ai/gpt-5-nano-2025-08-07"
response_obj = MagicMock()
response_obj.model = actual_model_used
response_obj._hidden_params = {"additional_headers": {}}
_override_openai_response_model(
response_obj=response_obj,
requested_model=requested_model,
log_context="test_context",
)
assert response_obj.model == actual_model_used
assert response_obj.model != requested_model
def test_override_model_uses_winning_model_for_fastest_response(self):
"""
Test that when fastest_response batch completion is used with a
comma-separated model list, the response model is set to the winning
model's group name (not the comma-separated list).
"""
requested_model = "openai/gpt-4o,gemini/gemini-2.5-flash"
winning_model_group = "gemini/gemini-2.5-flash"
downstream_model = "gemini-2.5-flash"
response_obj = MagicMock()
response_obj.model = downstream_model
response_obj._hidden_params = {
"fastest_response_batch_completion": True,
"additional_headers": {
"x-litellm-model-group": winning_model_group,
},
}
_override_openai_response_model(
response_obj=response_obj,
requested_model=requested_model,
log_context="test_context",
)
assert response_obj.model == winning_model_group
assert response_obj.model != requested_model
def test_override_model_preserves_response_when_fastest_response_no_model_group(
self,
):
"""
Test that when fastest_response is set but no model group header is
available, the actual downstream model is preserved.
"""
requested_model = "openai/gpt-4o,gemini/gemini-2.5-flash"
downstream_model = "gpt-4o-2024-08-06"
response_obj = MagicMock()
response_obj.model = downstream_model
response_obj._hidden_params = {
"fastest_response_batch_completion": True,
"additional_headers": {},
}
_override_openai_response_model(
response_obj=response_obj,
requested_model=requested_model,
log_context="test_context",
)
assert response_obj.model == downstream_model
def test_override_model_normal_when_fastest_response_not_set(self):
"""
Test that when fastest_response_batch_completion is not set, the
normal override behavior applies (model is set to requested_model).
"""
requested_model = "openai/gpt-4o"
downstream_model = "gpt-4o-2024-08-06"
response_obj = MagicMock()
response_obj.model = downstream_model
response_obj._hidden_params = {
"additional_headers": {
"x-litellm-model-group": "openai/gpt-4o",
},
}
_override_openai_response_model(
response_obj=response_obj,
requested_model=requested_model,
log_context="test_context",
)
assert response_obj.model == requested_model
class TestIsAzureModelRouterRequest:
"""Tests for _is_azure_model_router_request helper"""
def test_detects_model_router_with_underscore(self):
assert _is_azure_model_router_request("azure_ai/model_router") is True
assert (
_is_azure_model_router_request("azure_ai/model_router/my-deployment")
is True
)
def test_detects_model_router_with_hyphen(self):
assert _is_azure_model_router_request("azure_ai/model-router") is True
assert _is_azure_model_router_request("model-router") is True
def test_rejects_regular_models(self):
assert _is_azure_model_router_request("azure_ai/gpt-4") is False
assert _is_azure_model_router_request("gpt-4") is False
assert _is_azure_model_router_request("openai/gpt-3.5-turbo") is False
class TestStreamingOverheadHeader:
"""
Tests that x-litellm-overhead-duration-ms is emitted in streaming responses.
Regression tests for: streaming requests not including overhead header.
"""
def test_get_custom_headers_includes_overhead_when_set(self):
"""
get_custom_headers() returns x-litellm-overhead-duration-ms
when litellm_overhead_time_ms is in hidden_params.
"""
mock_user_api_key_dict = MagicMock(spec=UserAPIKeyAuth)
mock_user_api_key_dict.tpm_limit = None
mock_user_api_key_dict.rpm_limit = None
mock_user_api_key_dict.max_budget = None
mock_user_api_key_dict.spend = 0.0
mock_user_api_key_dict.allowed_model_region = None
hidden_params = {
"litellm_overhead_time_ms": 42.5,
"_response_ms": 500.0,
"model_id": "test-model-id",
"api_base": "https://api.openai.com",
}
headers = ProxyBaseLLMRequestProcessing.get_custom_headers(
user_api_key_dict=mock_user_api_key_dict,
call_id="test-call-id",
model_id="test-model-id",
cache_key="",
api_base="https://api.openai.com",
version="1.0.0",
response_cost=0.001,
model_region="",
hidden_params=hidden_params,
)
assert "x-litellm-overhead-duration-ms" in headers
assert headers["x-litellm-overhead-duration-ms"] == "42.5"
def test_get_custom_headers_omits_overhead_when_none(self):
"""
get_custom_headers() omits x-litellm-overhead-duration-ms
when litellm_overhead_time_ms is not in hidden_params.
"""
mock_user_api_key_dict = MagicMock(spec=UserAPIKeyAuth)
mock_user_api_key_dict.tpm_limit = None
mock_user_api_key_dict.rpm_limit = None
mock_user_api_key_dict.max_budget = None
mock_user_api_key_dict.spend = 0.0
mock_user_api_key_dict.allowed_model_region = None
hidden_params = {
"_response_ms": 500.0,
"model_id": "test-model-id",
}
headers = ProxyBaseLLMRequestProcessing.get_custom_headers(
user_api_key_dict=mock_user_api_key_dict,
call_id="test-call-id",
model_id="test-model-id",
cache_key="",
api_base="https://api.openai.com",
version="1.0.0",
response_cost=0.001,
model_region="",
hidden_params=hidden_params,
)
# Should be absent (None gets filtered by exclude_values)
assert "x-litellm-overhead-duration-ms" not in headers
def test_update_response_metadata_sets_overhead_on_stream_wrapper(self):
"""
update_response_metadata() sets litellm_overhead_time_ms on
a streaming response's _hidden_params when llm_api_duration_ms is available.
"""
from litellm.litellm_core_utils.llm_response_utils.response_metadata import (
update_response_metadata,
)
# Mock the logging object with llm_api_duration_ms set
mock_logging_obj = MagicMock()
mock_logging_obj.model_call_details = {
"llm_api_duration_ms": 200.0,
"litellm_params": {},
}
mock_logging_obj.caching_details = None
mock_logging_obj.callback_duration_ms = None
mock_logging_obj.litellm_call_id = "test-call-id"
mock_logging_obj._response_cost_calculator = MagicMock(return_value=0.001)
# Simulate a streaming result object with _hidden_params (like CustomStreamWrapper)
stream_result = MagicMock()
stream_result._hidden_params = {
"model_id": "test-model-id",
"api_base": "https://api.openai.com",
"additional_headers": {},
}
start_time = datetime.datetime.now() - datetime.timedelta(milliseconds=300)
end_time = datetime.datetime.now()
update_response_metadata(
result=stream_result,
logging_obj=mock_logging_obj,
model="gpt-4o",
kwargs={},
start_time=start_time,
end_time=end_time,
)
assert "litellm_overhead_time_ms" in stream_result._hidden_params
overhead = stream_result._hidden_params["litellm_overhead_time_ms"]
assert overhead is not None
assert isinstance(overhead, float)
# overhead = total_response_ms (~300ms) - llm_api_duration_ms (200ms) = ~100ms
assert overhead > 0
@pytest.mark.asyncio
async def test_streaming_response_includes_overhead_header(self):
"""
StreamingResponse returned by create_response() includes
x-litellm-overhead-duration-ms in its headers.
"""
async def mock_generator() -> AsyncGenerator[str, None]:
yield 'data: {"id":"chatcmpl-test","choices":[{"delta":{"content":"hi"}}]}\n\n'
yield "data: [DONE]\n\n"
headers = {
"x-litellm-overhead-duration-ms": "42.5",
"x-litellm-call-id": "test-call-id",
"x-litellm-model-id": "test-model-id",
}
response = await create_response(
generator=mock_generator(),
media_type="text/event-stream",
headers=headers,
)
assert isinstance(response, StreamingResponse)
assert response.headers.get("x-litellm-overhead-duration-ms") == "42.5"
def test_streaming_overhead_header_in_custom_headers_from_stream_hidden_params(
self,
):
"""
Verifies that when get_custom_headers() is called with a streaming
response's hidden_params (containing litellm_overhead_time_ms),
the x-litellm-overhead-duration-ms header is correctly populated.
This tests the critical path: update_response_metadata sets the value
→ get_custom_headers reads it → StreamingResponse header is set.
"""
mock_user_api_key_dict = MagicMock(spec=UserAPIKeyAuth)
mock_user_api_key_dict.tpm_limit = None
mock_user_api_key_dict.rpm_limit = None
mock_user_api_key_dict.max_budget = None
mock_user_api_key_dict.spend = 0.0
mock_user_api_key_dict.allowed_model_region = None
# This is what CustomStreamWrapper._hidden_params looks like after
# update_response_metadata() has been called on it
hidden_params = {
"model_id": "openai-gpt4o-deployment",
"api_base": "https://api.openai.com",
"additional_headers": {},
"litellm_overhead_time_ms": 55.3, # set by update_response_metadata
"_response_ms": 280.0,
"litellm_call_id": "test-call-id",
"response_cost": 0.002,
"cache_key": None,
"fastest_response_batch_completion": None,
"callback_duration_ms": None,
}
custom_headers = ProxyBaseLLMRequestProcessing.get_custom_headers(
user_api_key_dict=mock_user_api_key_dict,
call_id="test-call-id",
model_id=hidden_params.get("model_id"),
cache_key=hidden_params.get("cache_key") or "",
api_base=hidden_params.get("api_base") or "",
version="1.0.0",
response_cost=hidden_params.get("response_cost"),
model_region="",
hidden_params=hidden_params,
)
# The overhead header must be present and correct
assert "x-litellm-overhead-duration-ms" in custom_headers, (
"x-litellm-overhead-duration-ms header must be emitted during streaming. "
"It was missing — this is the streaming overhead header regression."
)
assert custom_headers["x-litellm-overhead-duration-ms"] == "55.3"
class TestDDSpanTaggerTagRequest:
"""Tests for DDSpanTagger.tag_request - key/model DD span tagging."""
def _make_user_api_key_dict(self, key_alias=None, token=None):
from litellm.proxy._types import UserAPIKeyAuth
d = UserAPIKeyAuth()
d.key_alias = key_alias
d.token = token
return d
def test_tags_key_alias_and_model(self):
"""key_alias and requested_model are set on the span when present."""
user_key = self._make_user_api_key_dict(
key_alias="my-prod-key", token="hashed123"
)
with patch("litellm.proxy.dd_span_tagger.set_active_span_tag") as mock_set_tag:
DDSpanTagger.tag_request(
user_api_key_dict=user_key,
requested_model="gpt-4o",
)
mock_set_tag.assert_any_call("litellm.key_alias", "my-prod-key")
mock_set_tag.assert_any_call("litellm.key_hash", "hashed123")
mock_set_tag.assert_any_call("litellm.requested_model", "gpt-4o")
def test_no_tags_when_key_absent(self):
"""No key tags are set when key_alias and token are None (e.g. 401 path)."""
user_key = self._make_user_api_key_dict(key_alias=None, token=None)
with patch("litellm.proxy.dd_span_tagger.set_active_span_tag") as mock_set_tag:
DDSpanTagger.tag_request(
user_api_key_dict=user_key,
requested_model=None,
)
mock_set_tag.assert_not_called()
def test_only_model_tagged_when_no_key_info(self):
"""requested_model is tagged even when there's no key info."""
user_key = self._make_user_api_key_dict(key_alias=None, token=None)
with patch("litellm.proxy.dd_span_tagger.set_active_span_tag") as mock_set_tag:
DDSpanTagger.tag_request(
user_api_key_dict=user_key,
requested_model="claude-3-5-sonnet",
)
mock_set_tag.assert_called_once_with(
"litellm.requested_model", "claude-3-5-sonnet"
)
class TestHasAttributeErrorInChain:
"""Tests for _has_attribute_error_in_chain helper."""
def test_direct_attribute_error(self):
exc = AttributeError("'str' object has no attribute 'get'")
assert _has_attribute_error_in_chain(exc) is True
def test_no_attribute_error(self):
exc = ValueError("some other error")
assert _has_attribute_error_in_chain(exc) is False
def test_attribute_error_in_cause(self):
inner = AttributeError("bad attribute")
outer = RuntimeError("wrapper")
outer.__cause__ = inner
assert _has_attribute_error_in_chain(outer) is True
def test_attribute_error_in_context(self):
inner = AttributeError("bad attribute")
outer = RuntimeError("wrapper")
outer.__context__ = inner
assert _has_attribute_error_in_chain(outer) is True
def test_attribute_error_in_original_exception(self):
inner = AttributeError("bad attribute")
outer = RuntimeError("wrapper")
outer.original_exception = inner # type: ignore
assert _has_attribute_error_in_chain(outer) is True
def test_attribute_error_nested_two_levels(self):
"""Simulates the real failure: AttributeError -> OpenAIException -> APIConnectionError."""
attr_err = AttributeError("'str' object has no attribute 'get'")
mid = Exception("OpenAIException wrapper")
mid.__context__ = attr_err
outer = Exception("APIConnectionError wrapper")
outer.__context__ = mid
assert _has_attribute_error_in_chain(outer) is True
def test_depth_limit_prevents_infinite_loop(self):
"""Ensure circular references don't cause infinite recursion."""
exc_a = RuntimeError("a")
exc_b = RuntimeError("b")
exc_a.__context__ = exc_b
exc_b.__context__ = exc_a # circular
assert _has_attribute_error_in_chain(exc_a) is False
@pytest.mark.asyncio
class TestHandleLLMApiExceptionDictDetail:
"""
Coverage for `_handle_llm_api_exception` HTTPException branch (Site 2).
Regression for case 2026-04-10-internal-bedrock-guardrail-streaming-error:
dict-detail HTTPExceptions raised by guardrails must round-trip cleanly
through ProxyException instead of being str()-mangled into a Python repr.
"""
async def _invoke(self, exc: Exception):
from litellm.proxy._types import ProxyException, UserAPIKeyAuth
processor = ProxyBaseLLMRequestProcessing(data={})
user_api_key_dict = UserAPIKeyAuth(api_key="sk-test")
proxy_logging_obj = MagicMock()
proxy_logging_obj.post_call_failure_hook = AsyncMock(return_value=None)
proxy_logging_obj.post_call_response_headers_hook = AsyncMock(return_value={})
try:
await processor._handle_llm_api_exception(
e=exc,
user_api_key_dict=user_api_key_dict,
proxy_logging_obj=proxy_logging_obj,
)
except ProxyException as raised:
return raised
raise AssertionError("ProxyException was not raised")
async def test_dict_detail_bedrock_shape_preserved(self):
exc = HTTPException(
status_code=400,
detail={
"error": "Violated guardrail policy",
"bedrock_guardrail_response": "...",
"guardrail_name": "bedrock-pii-guard",
},
)
proxy_exc = await self._invoke(exc)
assert proxy_exc.message == "Violated guardrail policy"
assert (
proxy_exc.provider_specific_fields["guardrail_name"] == "bedrock-pii-guard"
)
# No Python repr leakage of the dict into the message field.
assert "{'error':" not in proxy_exc.message
async def test_string_detail_unchanged(self):
exc = HTTPException(status_code=400, detail="Content blocked by guardrail")
proxy_exc = await self._invoke(exc)
assert proxy_exc.message == "Content blocked by guardrail"
assert proxy_exc.provider_specific_fields is None
async def test_not_found_error_preserves_404(self):
"""NotFoundError with status_code=404 should map to ProxyException code=404."""
from litellm.exceptions import NotFoundError
exc = NotFoundError(
message="Model gemini-3.1-flash-lite-preview not found",
model="gemini-3.1-flash-lite-preview",
llm_provider="gemini",
)
proxy_exc = await self._invoke(exc)
assert proxy_exc.code == "404"
assert "NotFoundError" in proxy_exc.message
async def test_exception_with_status_code_propagates(self):
"""Exception with a statically-set status_code should propagate it."""
from litellm.llms.vertex_ai.common_utils import VertexAIError
exc = VertexAIError(
status_code=429,
message="Rate limit exceeded",
)
proxy_exc = await self._invoke(exc)
assert proxy_exc.code == "429"
async def test_exception_without_status_code_defaults_to_500(self):
"""Exception with no status_code attribute defaults to 500."""
exc = ValueError("Something broke")
proxy_exc = await self._invoke(exc)
assert proxy_exc.code == "500"
class TestAsyncStreamingDataGeneratorFastPath:
"""Fast/slow path branching in async_streaming_data_generator."""
@staticmethod
async def _aiter(items):
for item in items:
yield item
@pytest.mark.asyncio
async def test_fast_path_skips_per_chunk_hook(self, monkeypatch):
"""With no callbacks/guardrails/cost-injection, chunks pass through
unchanged and the per-chunk hook is NOT awaited."""
monkeypatch.setattr(litellm, "callbacks", [])
ProxyLogging._callback_capabilities_cache.clear()
proxy_logging_obj = ProxyLogging(user_api_key_cache=MagicMock())
hook_spy = AsyncMock(side_effect=lambda **kw: kw["response"])
monkeypatch.setattr(
proxy_logging_obj, "async_post_call_streaming_hook", hook_spy
)
chunks = [b"event: a\ndata: {}\n\n", b"event: b\ndata: {}\n\n"]
out = [
c
async for c in ProxyBaseLLMRequestProcessing.async_streaming_data_generator(
response=self._aiter(chunks),
user_api_key_dict=MagicMock(spec=UserAPIKeyAuth),
request_data={"model": "claude-x"},
proxy_logging_obj=proxy_logging_obj,
serialize_chunk=ProxyBaseLLMRequestProcessing.return_sse_chunk,
serialize_error=lambda e: "data: error\n\n",
)
]
assert out == chunks # bytes pass through return_sse_chunk untouched
hook_spy.assert_not_awaited()
@pytest.mark.asyncio
async def test_slow_path_runs_per_chunk_hook(self, monkeypatch):
"""A callback that overrides async_post_call_streaming_hook forces the
slow path and the per-chunk hook is invoked."""
class _StreamingCb(CustomLogger):
async def async_post_call_streaming_hook(self, user_api_key_dict, response):
return response
cb = _StreamingCb()
monkeypatch.setattr(litellm, "callbacks", [cb])
ProxyLogging._callback_capabilities_cache.clear()
proxy_logging_obj = ProxyLogging(user_api_key_cache=MagicMock())
hook_spy = AsyncMock(side_effect=lambda **kw: kw["response"])
monkeypatch.setattr(
proxy_logging_obj, "async_post_call_streaming_hook", hook_spy
)
out = [
c
async for c in ProxyBaseLLMRequestProcessing.async_streaming_data_generator(
response=self._aiter([{"type": "message_stop"}]),
user_api_key_dict=MagicMock(spec=UserAPIKeyAuth),
request_data={"model": "claude-x"},
proxy_logging_obj=proxy_logging_obj,
serialize_chunk=ProxyBaseLLMRequestProcessing.return_sse_chunk,
serialize_error=lambda e: "data: error\n\n",
)
]
assert len(out) == 1
hook_spy.assert_awaited_once()
ProxyLogging._callback_capabilities_cache.clear()