Commit Graph

39598 Commits

Author SHA1 Message Date
Mateo Wang
533eab4dbd
fix(tests/vcr): make Redis cassette cache replay deterministically (zero VCR misses on consecutive runs) (#28826)
* test(vcr): make Redis-backed cassettes replay deterministically across runs

- Pin LITELLM_LOCAL_MODEL_COST_MAP=True in the shared VCR harness so the
  per-test importlib.reload(litellm) no longer fetches the model cost map
  from raw.githubusercontent.com. That live fetch was being recorded into
  cassettes; for tests that subsequently skip it was the only recorded
  episode, so the persister refused to save it (skipped tests don't persist)
  and the test re-recorded it live every run (MISS:NOT_PERSISTED).

- Compare-time symmetric matcher tolerance for Google OAuth (ya29.*) tokens,
  observability/telemetry payloads, credential-exchange bodies, and volatile
  UUID/timestamp tokens, so existing cassettes select a recorded episode
  instead of growing past the 50-episode cap and re-recording live.

- Don't record fire-and-forget telemetry (langfuse/arize/otel/...) into
  non-telemetry tests' cassettes. Several modules set litellm.success_callback
  at import time, so observability logging is globally enabled and an async
  flush from the background logging worker lands in an unrelated test's VCR
  window, saved as a spurious MISS:RECORDED (observed: a Langfuse batch from
  another completion landing on test_lowest_latency_routing_buffer). Such a
  request now passes through live (telemetry hosts aren't real-spend hosts);
  tests that actually assert on telemetry keep recording it.

- Dedupe + cap the VCR diagnostic dump so the classification summary survives
  CircleCI's ~400KB step-output truncation.

- Stabilize a non-deterministic rate-limit test body; mark AWS Secrets Manager
  lifecycle tests VCR-incompatible (uniquely-named secrets can't be replayed).

- Mark test_router_text_completion_client VCR-incompatible: it fires 300
  identical requests to verify async-client reuse, but vcrpy patches the HTTP
  transport so replay never exercises the real connection pool the test
  validates, and recording 300 near-identical episodes overflows the
  50-episode cap (MISS:OVERFLOW every run). It hits a free mock endpoint.

- Mark the Vertex AI MaaS Mistral OCR tests (vertex_ai/mistral-ocr-2505)
  VCR-incompatible: the MaaS model is not provisioned in the CI GCP project,
  so the live :rawPredict call fails and the test skips every run, leaving no
  cassette to record (MISS:NOT_PERSISTED every run). Sibling direct-Mistral
  and Azure OCR tests are unaffected and still replay from cache.

* fix(tests/vcr): refresh cassette TTL on read so replayed cassettes don't expire

The Redis VCR persister loaded cassettes with a plain GET, which does not
touch the key's TTL. A cassette that is only ever replayed (HIT/NOOP, never
re-recorded) therefore expired exactly 24h after its last *write*, no matter
how often it was read. Whichever CI run happened to cross that boundary
re-recorded the cassette live and surfaced a spurious VCR MISS on otherwise
deterministic cassettes — the residual per-run flakiness floor (a different
random subset of read-only cassettes expiring each run).

Slide the expiry forward on every successful load (best-effort EXPIRE), so
any cassette used at least once per TTL window stays alive indefinitely and
the 2nd/3rd run of a day replays cleanly.

* fix(tests/vcr): recover from spurious GET-None for existing cassette keys

Under concurrent CI load, the persister's load GET was observed returning
None for a cassette key that demonstrably existed on the (single, non-
clustered) Redis master — an external monitor saw the key present with a
healthy TTL at the same instant the in-process client read None. Because
None is a valid GET result (not a RedisError), the retry-on-error client
config never engaged, so the cassette re-recorded live (a phantom
MISS:RECORDED); for flaky/networked tests the failed live call then
triggered a pytest rerun, which is why a rotating subset of otherwise
deterministic tests missed each run.

On a None result, re-check EXISTS and re-read once. If the key really
exists, use the recovered value and log [vcr-transient-miss-recovered]
(also counted in cassette_cache_health). A genuinely absent key (a new
cassette) still falls through to CassetteNotFoundError.

* chore(tests/vcr): TEMP diagnostic for persistent-miss cassette load path

Logs GET/EXISTS at load time for the three cassettes that re-record every
run despite being present in Redis, to capture what the in-process client
sees. To be reverted before merge.

* chore(tests/vcr): write load diagnostic to Redis (truncation-proof)

CI stdout truncates to the last ~400KB, dropping the early loaddbg lines
for the alphabetically-first failing test. Push the load probe to a Redis
list instead so it survives. To be reverted before merge.

* fix(tests/vcr): don't drop stored telemetry episodes during cassette load

Root cause of the residual per-run misses on present cassettes: vcrpy's
Cassette._load() replays each *stored* interaction through Cassette.append(),
which runs before_record_request on it — and a None return there silently
drops that episode. The telemetry-leak suppressor (_should_drop_telemetry_record)
returns None for telemetry requests, so when a non-telemetry-named test (or the
alphabetically-first test in a worker, whose _current_test_nodeid is still empty)
loaded a cassette containing a Langfuse ingestion episode, the episode was
dropped on read — forcing an endless live re-record (a phantom MISS:RECORDED on
a cassette that was demonstrably present in Redis). Verified by reproducing
Cassette._load() against the real cassette: empty/non-telemetry nodeid -> 0
episodes survive; with the guard -> 1 survives.

Fix: guard the suppressor with a thread-local set around Cassette._load (via a
small idempotent monkeypatch), so the drop only ever stops *new* incidental
telemetry from being recorded and never filters the existing cassette on read.

Also drops the speculative GET-None recovery + its diagnostics from the previous
commits: the load diagnostic showed GET returns the cassette bytes fine
(get=1440B), so the persister never returned a spurious None — the loss happened
later in vcrpy's append. The proven TTL-refresh-on-read fix is retained.

* fix(tests/vcr): drop incidental telemetry export POSTs to stop rotating async-flush misses

litellm's observability loggers flush on a background thread, so a Langfuse
ingestion POST scheduled by one telemetry test can fire mid-way through a
*later* telemetry-named test (after that test's own httpx mock has exited) and
be recorded by VCR as a phantom episode — a non-deterministic MISS:RECORDED /
PARTIAL that rotates onto a different telemetry test from run to run.

Telemetry export POSTs are fire-and-forget; no test asserts on a *recorded*
export response except the pass-through proxy test (which forwards a client POST
to Langfuse ingestion and replays its 207). So _should_drop_telemetry_record now
drops incidental export POSTs for every test except that one. Dropping returns
None (live fire-and-forget, never stored), so it can only turn a phantom miss
into a harmless live call, never the reverse; recorded read-back GETs that
telemetry tests assert on are matched by method and left untouched.

* fix(tests/vcr): restore assertion in test_banner_silent_when_vcr_disabled

The assertion that the banner is suppressed when VCR is disabled was
inadvertently moved into test_diagnostic_log_silent_when_no_dir when
the diagnostic-log tests were added, leaving the disabled-VCR test
verifying nothing.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
2026-05-26 11:30:44 -07:00
ryan-crabbe-berri
f75a7c6b22
fix(model-edit): allow clearing custom pricing on wildcard models (#28719)
* fix(model-edit): allow clearing custom input/output cost on wildcard deployments

A user-set pricing override on a `/model/*` wildcard deployment could not
be removed: clearing the Input/Output Cost fields in the UI succeeded
visually, but the next read still showed the old values because both
`litellm_params` and `model_info` (mirrored via `SPECIAL_MODEL_INFO_PARAMS`)
retained the original rates.

UI: when the pricing field is touched but left empty, send `null` instead
of dropping it from the payload so the backend sees the clear intent. The
cache-read-cost fallback now guards against `null` as well as `undefined`
so a cleared input cost cannot silently wipe the cache-read override.

Backend: `update_db_model` honors explicit-null clears, but ONLY for
`SPECIAL_MODEL_INFO_PARAMS` (the 4 pricing fields). Restricting the
null-clear path prevents a team-scoped caller from using this codepath to
null out privileged fields like `team_id` or access groups.

Tests cover both clear paths (`litellm_params` and `model_info`), the
SPECIAL_MODEL_INFO_PARAMS mirror, PATCH semantics for omitted fields, and
the security guard that non-pricing nulls don't reach the merged dict.

Resolves LIT-3250

* fix(model-edit): run null-clears after both merges, not interleaved

The previous version cleared `model_info` from inside the litellm_params
merge block, but the subsequent `model_info.update(...)` re-injected the
old pricing because the UI's PATCH carries the full model_info blob with
the stale values still in it. Move the explicit-null clear pass to after
both merges so a model_info passthrough cannot resurrect cleared fields.

Adds a regression test for the realistic UI submit shape (both blobs in
the patch, model_info still holding the old pricing).

* test(e2e): clear-custom-pricing flow with create/delete cleanup

Covers the dashboard model edit form's pricing-clear flow end-to-end:
seeds a deployment with custom input/output pricing, drives the UI to
clear both fields, asserts the outgoing PATCH sends explicit nulls,
and confirms via /v2/model/info that the override is gone from both
litellm_params and model_info.

The dashboard DB persists across this suite, so beforeEach creates a
uniquely-named deployment and afterEach POSTs /model/delete to leave
the DB clean regardless of test outcome.

* fix(model-edit): extend pricing clear to cache_read and cache_write costs

Pre-existing parallel of the wildcard input/output cost bug: cleared
cache_read_input_token_cost and cache_creation_input_token_cost overrides
silently persisted because the UI omitted the key (delete or fallback) and
the backend null-clear allowlist did not cover them.

- types/router.py: add cache_read_input_token_cost and
  cache_creation_input_token_cost to SPECIAL_MODEL_INFO_PARAMS, so they are
  mirrored between litellm_params and model_info by Deployment.__init__ and
  honoured by the null-clear loop in update_db_model.
- model_info_view.tsx: emit explicit null for touched-but-empty cache_read
  and cache_write fields. Preserve the input_cost->cache_read mirror only
  when cache_read itself was not touched.
- model_management_endpoints.py: update the allowlist comment.
- Tests: three new unit tests for cache clear paths and a preserve check;
  the e2e spec now seeds, clears, and asserts null PATCH + key-absence for
  all four pricing fields.
2026-05-26 09:37:23 -07:00
ryan-crabbe-berri
a8263cbc88
fix(ui): route API Reference back to query-param page (#28726)
* fix(ui): route API Reference back to query-param page

The path-based /ui/api-reference route was broken in practice — the
page-local useProxySettings hook didn't match what the root page passes
down. Remove api_ref from the migration maps (LEGACY_REDIRECTS in
app/page.tsx, MIGRATED_PAGES in leftnav.tsx and (dashboard)/layout.tsx),
point the leftnav item back at page="api_ref", and restore the api_ref
render branch in the root page. The path-based page.tsx and the
useProxySettings hook stay in place unchanged; only api_ref is moved
back to query-param routing while the migration infrastructure is
preserved for future page moves.

* fix(ui): alias ?page=api-reference to api_ref branch

Handles bookmarks of the hyphen-form query param that was live during
the brief path-based migration window, so they render the working
APIReferenceView instead of falling through to the default page.
2026-05-26 09:27:30 -07:00
Mateo Wang
96a2e8b16d
fix(azure): preserve AD token refresh in v1 OpenAI client path (#28627)
* fix(azure): preserve AD token refresh in v1 OpenAI client path

The /openai/v1/ code path (api_version in {"v1", "latest", "preview"})
constructs a plain OpenAI/AsyncOpenAI client, but only forwarded
`api_key` from `azure_client_params`. When `enable_azure_ad_token_refresh`
is set (or any AD-only auth), `api_key` is None and the client
constructor raised "The api_key client option must be set...", breaking
every Azure call with a v1 api_version.

The OpenAI SDK (>=2.20.0) accepts a callable for `api_key` and re-invokes
it on every request via `_refresh_api_key`, so we now forward
`azure_ad_token_provider` directly — preserving the per-request token
refresh behavior of the regular AzureOpenAI client and avoiding the
expiry hole that resolving the token once at client-creation time would
introduce. Static `azure_ad_token` strings fall through to `api_key`.

For the async path we wrap the sync provider returned by azure-identity
in an async function since AsyncOpenAI expects `Callable[[], Awaitable[str]]`.

Fixes #27945

https://claude.ai/code/session_01UnzrDSFUUgp5T2wRoPMxq5

* fix(azure): offload sync token provider to thread in v1 async wrapper

* fix(azure): include AD credential identity in v1 client cache key

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-05-25 21:08:52 -07:00
Mateo Wang
48d7e15b83
chore(admin-ui): regenerate static export with trailingSlash: true (#28112)
* chore(admin-ui): regenerate static export with trailingSlash: true

Rebuilds litellm/proxy/_experimental/out/ from ui/litellm-dashboard with
`trailingSlash: true` enabled in next.config.mjs. Next.js now emits every
route as <dir>/index.html (e.g. mcp/oauth/callback/index.html) instead of
<dir>.html with a sibling metadata-only directory, which fixes the 404 on
extensionless URLs served through FastAPI's StaticFiles(html=True) mount.

This is the build artifact half of the fix; the config change, Dockerfile
cleanup, and regression test live in the follow-up source PR that stacks
on top of this branch.

* fix(admin-ui): emit nested routes as <dir>/index.html (#28106)

Linear and other OAuth providers redirect the user back to
/ui/mcp/oauth/callback?code=...&state=... after the consent step. The
packaged Next.js static export only produced /ui/mcp/oauth/callback.html,
so FastAPI's StaticFiles served a 404 on the extensionless URL and the
OAuth handshake never completed.

The Dockerfile.non_root build step tried to paper over this at image-build
time with `for html_file in *.html; do ...`, but that shell glob does not
recurse, so nested routes like mcp/oauth/callback.html were left stranded
next to an empty mcp/oauth/callback/ directory containing only Next.js
metadata. The runtime restructure step in proxy_server.py was then skipped
because the .litellm_ui_ready marker had already been dropped.

Set trailingSlash: true in the dashboard's Next.js config so the export
emits every nested route as <dir>/index.html natively. The Dockerfile loop
is now a no-op for the bundled UI and has been removed; the
.litellm_ui_ready marker is still written so the proxy keeps skipping the
redundant Python restructure step at startup. Stacks on top of the static
export regeneration in the parent branch.

* chore: restore origin/litellm_internal_staging out files
2026-05-25 21:06:50 -07:00
Mateo Wang
c23b19f09c
feat(openai): apply regional-processing cost uplift for EU/US data residency (#28626)
* feat(openai): apply regional-processing cost uplift for EU/US data residency

OpenAI charges a 10% uplift on the latest GPT models when requests are
served from a regionalized hostname (eu./us.api.openai.com).  Infer the
region from `api_base`, expose it on `kwargs["litellm_params"]["data_residency"]`,
and multiply the computed cost by a per-model
`regional_processing_uplift_multiplier_<region>` field.

https://claude.ai/code/session_012ebH44s7ohYxjoix5CXzTW

* test: allow regional_processing_uplift_multiplier_{eu,us} in model_prices schema

* fix(cost): tighten data_residency inference and restore model_cost in tests

- Only infer OpenAI data_residency when custom_llm_provider == "openai";
  drop the implicit None fallback so non-OpenAI callers can't accidentally
  pick up a regional tag from a stray OpenAI hostname.
- _local_model_cost_map fixture now snapshots and restores
  litellm.model_cost and LITELLM_LOCAL_MODEL_COST_MAP so tests don't leak
  state across the session.

* refactor(openai): move data_residency helper under llms/openai

* fix: thread data_residency through realtime stream cost calculation

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(cost): thread data_residency through batch_cost_calculator

Apply the OpenAI regional-processing uplift multiplier to retrieve_batch
cost paths so Batch API requests served via eu./us.api.openai.com are
priced at the same uplifted token rates as completions/transcriptions.

* refactor(openai): encapsulate provider check inside infer_openai_data_residency

Move the custom_llm_provider == "openai" guard from get_litellm_params
into the helper itself so the core utility no longer carries
provider-specific dispatch logic. Callers pass through the provider
unconditionally; the helper returns None for any non-OpenAI provider.

* fix(responses): thread data_residency through Responses logging params

The Responses API paths build their logging litellm_params dict after
provider resolution but did not include data_residency, so cost calc
saw None even when the effective api_base was a regional OpenAI host.

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
2026-05-25 20:36:14 -07:00
yuneng-jiang
f38c16c71e
test(proxy): add harness for proxy_server.py behavior-pinning (#28827)
* test(proxy): add harness for proxy_server.py behavior-pinning

Creates tests/test_litellm/proxy/proxy_server/ with:
- conftest.py: 11 shared fixtures (app, client, mock_prisma, auth_as,
  mock_router with parametrized response builders, normalize, etc.)
- _coverage_check.py: per-PR coverage gate (line + branch) against a
  baseline, self-selects target by inspecting which placeholder files
  have been filled
- _pin_check.py: AST-based gate that verifies every pin-list item has
  >=1 happy + >=1 error test with a real assertion (no status-only)
- test_harness_smoke.py: 19 smoke tests covering every fixture +
  both scripts end-to-end
- 26 placeholder test files (one docstring each) reserved for
  follow-up PRs per the directory ownership in the Notion plan
- .coverage_baseline pinned at 0% so future PRs measure deltas
  against new-tests-only and aren't entangled with the broader
  scattered test suite

Adds a dedicated proxy-server job to test-unit-proxy-endpoints.yml
so this directory's runtime + coverage are tracked independently.

Plan: https://www.notion.so/36c43b8acdab81ee845fd5365128a2fc

* ci(proxy-endpoints): allow workflow_dispatch

Lets the workflow be triggered manually on a branch via
`gh workflow run`, which is needed for the verify-first
flow on workflow changes before opening a PR.

* test(proxy): address review feedback on proxy_server harness

- conftest.py: anchor sys.path insert to __file__ (Path(__file__).resolve().parents[4])
  instead of CWD-relative os.path.abspath("../../../../") which resolved
  to the wrong directory when pytest is launched from the repo root.
- _coverage_check.py: actually read .coverage_baseline and use it as
  the floor (line_min = max(target, baseline)). Closes the gap between
  the PR description's "delta semantics" and what the script was doing.
  With baseline=0.0 today this is a no-op; future PRs that update the
  baseline cause regressions (test deletions etc.) to trip the gate
  even if the static PR target is still met.
- _pin_check.py: drop unreachable startswith("_") guard
  (test_*.py glob never yields underscore-prefixed names) and read
  each test file once instead of twice.
2026-05-25 20:26:44 -07:00
ishaan-berri
48dd71b818
ci: add daily oss-agent-shin branch creation workflow (#28829)
Creates litellm_oss_agent_shin_MM_DD_YYYY from main every day at 00:00 UTC.
Lets us retarget oss-agent-shin fork PRs onto a canonical branch so CircleCI runs with secrets, without granting the agent write access.

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: Ishaan Jaffer <ishaanjaffer0324@gmail.com>
2026-05-25 20:04:40 -07:00
yuneng-jiang
1b6788e0c4
fix(proxy): hydrate wildcard discovery credentials (#28284) (#28822)
* fix(proxy): hydrate wildcard discovery credentials

* fix(proxy): constrain wildcard credential hydration

Co-authored-by: Dibyo Mukherjee <dibyo@adobe.com>
2026-05-25 19:23:35 -07:00
yuneng-jiang
7cd98508e7
fix(team): keep team_alias cache in sync on _cache_team_object writes (#28737)
* fix(team): keep team_alias cache in sync on _cache_team_object writes

_cache_team_object wrote only to the team_id:<id> cache key, but the
JWT auth path that uses team_alias_jwt_field reads from a separate
team_alias:<alias> key (get_team_object_by_alias caches under both
keys on miss, but reads only the alias-keyed one). After any
team-mutation endpoint (team_model_add, team_model_delete,
update_team, the two access-group writes) the team_id cache was
refreshed but the team_alias cache stayed stale until TTL — JWT
callers using team_alias_jwt_field kept seeing the pre-mutation
team for the full cache window.

Mirror the write under the alias key inside _cache_team_object so
every existing caller stays in sync without further changes. Skip
the alias write when team_alias is None/empty so we don't collide
across alias-less teams.

Surfaced testing the LIT-3244 cherry-pick on patch/1.86.0: the
LIT-3244 fix correctly invalidated the team_id cache but the
customer's JWT used team_alias_jwt_field, so they kept hitting the
stale alias-keyed entry.

* fix(team): delete (not overwrite) team_alias cache on _cache_team_object

The prior shape of this PR wrote both team_id:<id> AND team_alias:<alias>
from _cache_team_object. team_alias is NOT unique in the schema
(no @unique on LiteLLM_TeamTable.team_alias), and get_team_object_by_alias
enforces uniqueness on its own DB-fetch path (len(teams) > 1 raises).
Writing the alias-keyed cache from the generic refresh path bypassed
that check: a team admin renaming their team to collide with another
team's alias could silently overwrite the cached team for JWT-by-alias
auth, swapping the resolved team under that alias for the cache window.

Switch the alias-keyed operation from a write to a delete (mirroring
the dual-cache delete pattern in _delete_cache_key_object). After every
team write, the next JWT-by-alias reader cache-misses and falls through
to get_team_object_by_alias, which (a) re-fetches the fresh team from
DB, closing the LIT-3244 staleness gap that motivated this PR, and
(b) enforces alias uniqueness before populating either cache key.

team_id:<id> writes are unchanged — team_id is the table PK and is
guaranteed unique.

Surfaced in veria-ai review on #28739.

* fix(managed-files): anchor model_id regex so it doesn't match llm_output_file_model_id

extract_model_id_from_unified_id used `re.search(r"model_id,([^;]+)", ...)`
which substring-matches the `model_id,` inside the file-ID encoding's
`llm_output_file_model_id,<deployment_uuid>` field. parse_unified_id
then fed that deployment UUID back into the auth path as a model
candidate via _extract_models_from_managed_resource_id, and every
team-BYOK file attach 403'd with:

    team not allowed to access model. This team can only access
    models=['openai/*']. Tried to access <deployment-uuid>

The team's models list correctly contains the public name (`openai/*`)
that target_model_names matches, but the bogus UUID candidate fails
the wildcard check first.

Anchor the regex to a field boundary (`(?:^|;)model_id,`) so it
matches the legitimate top-level `model_id,<value>` field on
vector_store unified IDs and skips substring matches inside other
fields. File-IDs (which have no top-level `model_id` field) now
return None and contribute no spurious UUID candidate.

Surfaced reproducing LIT-3244 on patch/1.86.0 with the customer's
exact flow: team with openai/* BYOK deployment, JWT-scoped user,
POST /v1/vector_stores/{id}/files attaching a file uploaded with
target_model_names=openai/gpt-4o.
2026-05-25 19:16:36 -07:00
yuneng-jiang
0d5040fc06
chore(ci): merge dev branch (#28807)
* chore(proxy): route path-dependent call sites through get_request_route

Replace direct ``request.url.path`` reads in auth, ACL, routing, and
audit-log decisions with ``get_request_route(request)`` — the helper
already added in ``auth/auth_utils.py`` that returns the ASGI
``scope["path"]`` with ``root_path`` stripped. Starlette reconstructs
``url.path`` from the Host header; ``scope["path"]`` is uvicorn's
parse of the request line and matches what FastAPI dispatches on, so
it's the authoritative route for any decision that should agree with
the actual handler.

Sites:
- _experimental/mcp_server/auth/user_api_key_auth_mcp.py
- management_endpoints/mcp_management_endpoints.py
- vector_store_endpoints/utils.py
- pass_through_endpoints/pass_through_endpoints.py
- auth/route_checks.py
- litellm_pre_call_utils.py
- spend_tracking/spend_management_endpoints.py
- common_utils/http_parsing_utils.py
- management_helpers/utils.py
- health_endpoints/_health_endpoints.py

Adds regression tests in tests/proxy_unit_tests/test_proxy_routes.py
that construct a Request with scope["path"] set to a benign route and
the Host header crafted so url.path would resolve differently; each
site's decision is asserted against scope["path"].

* chore(proxy): make get_request_route imports lazy at call sites

Move the ``from litellm.proxy.auth.auth_utils import get_request_route``
imports added in the prior commit back to the function bodies that use
them. The module-level form participates in a long-standing import
cycle through ``auth_utils -> _types -> ...`` and was flagged by CodeQL
on the PR; the lazy form matches the pattern the proxy already uses
for ``user_api_key_auth`` and related helpers elsewhere in these files.

Also drop the ``RouteChecks._is_assistants_api_request`` delegation in
``_get_metadata_variable_name`` introduced in the prior commit — the
delegation pulled ``RouteChecks`` into the same cycle, and the call
site reuses the resolved route for its other branches, so inlining
the substring check is both cycle-free and avoids a redundant second
``get_request_route`` call.

Comment in test_proxy_routes.py acknowledges that the two MCP table
entries exercise ``get_request_route`` directly rather than the full
production handler (which needs ASGI scope + MCP state to invoke).

---------

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: user <70670632+stuxf@users.noreply.github.com>
2026-05-25 18:58:49 -07:00
ryan-crabbe-berri
66f10bceea
feat(proxy): allow llm_api_routes virtual keys to list MCP servers (#28442)
* feat(proxy): allow llm_api_routes virtual keys to list MCP servers

Add a new `mcp_discovery_routes` group (GET /v1/mcp/server and GET
/v1/mcp/server/{server_id}) and include it in `llm_api_routes` so that
virtual keys configured with `allowed_routes=["llm_api_routes"]` can
discover the MCP servers they have access to. Previously these calls
failed with 'Virtual key is not allowed to call this route. Only allowed
to call routes: [llm_api_routes]'.

The GET handlers already sanitize the response for restricted virtual
keys via `_sanitize_mcp_server_list_for_virtual_key`, stripping
credential-bearing fields (url, headers, env). Write methods
(POST/PUT/DELETE) on the same paths remain gated by the existing
handler-level admin role checks.

The new discovery list is intentionally kept OUT of
`mcp_inference_routes`, so `is_llm_api_route()` still returns False
for these paths — this preserves the existing contract that
DISABLE_LLM_API_ENDPOINTS must not block the Admin UI from listing MCP
servers.

Co-authored-by: ryan-crabbe-berri <ryan-crabbe-berri@users.noreply.github.com>

* refactor(proxy): make MCP discovery carve-out method-aware

Replace the `mcp_discovery_routes` group in `llm_api_routes` with a
method-aware special case inside `is_virtual_key_allowed_to_call_route`.
Virtual keys with allowed_routes=["llm_api_routes"] are now permitted
to call only GET /v1/mcp/server and GET /v1/mcp/server/{server_id} —
non-GET methods and multi-segment admin sub-paths fall through to the
existing 403. This keeps the general llm_api_routes list free of
management paths and avoids accidentally exposing POST/PUT/DELETE
writes through the route-check layer.

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: ryan-crabbe-berri <ryan-crabbe-berri@users.noreply.github.com>
2026-05-25 16:34:18 -07:00
ryan-crabbe-berri
c127968dfb
fix(ui): show 2-decimal precision for max_budget on key overview (#28809)
The Key Info Overview tab's Spend card truncated sub-dollar budgets to
"$0" because formatNumberWithCommas defaults to 0 decimals. The Settings
tab passes 2; align the overview so a $0.10 budget renders as "$0.10".

Resolves LIT-2845
2026-05-25 16:33:48 -07:00
yuneng-jiang
d98ada8c3f
chore(ci): merge dev branch (#28657)
* feat(dashboard): navbar hierarchy + Agent Platform notifications (#27543)

* feat(dashboard): refine navbar zones and Agent Platform notice

Restructure the admin navbar for production users: clear product vs community
vs personal columns with vertical dividers, icon-only Slack/GitHub in a
shared chip, and Docs/Blog typography aligned on an 8px rhythm.

Add a notifications bell with popover linking to the LiteLLM Agent Platform
repo and optional mark-as-read persistence.

Promote the account control with initials avatar, single-line display name,
and navDisplayName mapping for placeholder user ids (e.g. default_user_id).

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(dashboard): address PR review — AntD buttons, public page guard, dedupe regex

- Replace raw <button> with AntD Button in BlogDropdown, NotificationsBell, UserDropdown, and test mock
- Guard NotificationsBell + container behind !isPublicPage to avoid rendering on public pages
- Remove redundant equality checks in navDisplayName (regex already covers them)
- Remove unused `lower` variable after simplification

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>

* fix(dashboard): drop dead useHealthReadiness import in navbar

The module was removed in #27896 (replaced by useHealthReadinessDetails),
but the import survived the rebase. The symbol is unused — only
useHealthReadinessDetails is consumed in the file. Removing the dead
import unblocks the UI TypeScript build.

* fix(dashboard): align CommunityEngagementButtons test with icon-only aria-labels

The component was refactored to an icon-only chip with aria-label='LiteLLM
on GitHub' (squash #27543), but the test still asserted /star us on
github/i. Update the query to match the rendered accessible name.

* refactor(dashboard): drop unused props from NavbarProps

The navbar refactor moved user identity + dark-mode state to internal
hooks (useAuthorized, useWorker), but the NavbarProps interface still
declared userID, userEmail, userRole, premiumUser, isDarkMode, and
toggleDarkMode as required, forcing every caller to thread them through.

Drop them from the interface and all four call sites (page.tsx,
(dashboard)/layout.tsx, public_model_hub.tsx, navbar.test.tsx). Also
shrinks the destructure in layout.tsx so the now-unused locals stop
being pulled out of useAuthorized().

* refactor(dashboard): use useSyncExternalStore for NotificationsBell dismiss flag

Reads/writes of the litellmHideAgentPlatformBanner key were done
directly inside NotificationsBell via a useEffect + useState pair.
Every other localStorage-backed flag in the dashboard (Disable
ShowPrompts, DisableBouncingIcon, DisableShowNewBadge,
DisableUsageIndicator, DisableBlogPosts) is wrapped in a
useSyncExternalStore hook over localStorageUtils so all mounted
components stay in sync.

Extract useHideAgentPlatformBanner to follow the same shape, swap
NotificationsBell to consume it, and add a regression test that
two sibling bells stay in sync without a remount when one is
dismissed.

* refactor: mask credential fields in proxy settings GET responses (#28682)

* refactor: mask credential fields in proxy settings GET responses

Brings SSO settings, cache settings, and the email/Slack alerting view in
/get/config/callbacks in line with the HashiCorp Vault config-override
pattern, so persisted credentials are not transported back to the UI in
plaintext.

* refactor: harden short-value masking and hoist alerting var constant

Closes two review observations:

- mask_sensitive_keys now replaces short values (below the visible
  prefix+suffix length) with an all-mask string instead of returning them
  unchanged, so a 1-7 character credential is no longer round-tripped
  verbatim.
- _ALERTING_SENSITIVE_VARS is moved out of get_config() to a module-level
  constant, matching the analogous _SSO_SENSITIVE_FIELDS and
  _CACHE_SENSITIVE_FIELDS in the SSO and cache endpoint files.

---------

Co-authored-by: Krrish Dholakia <krrish+github@berri.ai>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-25 13:48:47 -07:00
yuneng-jiang
5f75be5c1c
chore(ci): merge dev branch (#28801)
* chore(proxy): route path-dependent call sites through get_request_route

Replace direct ``request.url.path`` reads in auth, ACL, routing, and
audit-log decisions with ``get_request_route(request)`` — the helper
already added in ``auth/auth_utils.py`` that returns the ASGI
``scope["path"]`` with ``root_path`` stripped. Starlette reconstructs
``url.path`` from the Host header; ``scope["path"]`` is uvicorn's
parse of the request line and matches what FastAPI dispatches on, so
it's the authoritative route for any decision that should agree with
the actual handler.

Sites:
- _experimental/mcp_server/auth/user_api_key_auth_mcp.py
- management_endpoints/mcp_management_endpoints.py
- vector_store_endpoints/utils.py
- pass_through_endpoints/pass_through_endpoints.py
- auth/route_checks.py
- litellm_pre_call_utils.py
- spend_tracking/spend_management_endpoints.py
- common_utils/http_parsing_utils.py
- management_helpers/utils.py
- health_endpoints/_health_endpoints.py

Adds regression tests in tests/proxy_unit_tests/test_proxy_routes.py
that construct a Request with scope["path"] set to a benign route and
the Host header crafted so url.path would resolve differently; each
site's decision is asserted against scope["path"].

* chore(proxy): make get_request_route imports lazy at call sites

Move the ``from litellm.proxy.auth.auth_utils import get_request_route``
imports added in the prior commit back to the function bodies that use
them. The module-level form participates in a long-standing import
cycle through ``auth_utils -> _types -> ...`` and was flagged by CodeQL
on the PR; the lazy form matches the pattern the proxy already uses
for ``user_api_key_auth`` and related helpers elsewhere in these files.

Also drop the ``RouteChecks._is_assistants_api_request`` delegation in
``_get_metadata_variable_name`` introduced in the prior commit — the
delegation pulled ``RouteChecks`` into the same cycle, and the call
site reuses the resolved route for its other branches, so inlining
the substring check is both cycle-free and avoids a redundant second
``get_request_route`` call.

Comment in test_proxy_routes.py acknowledges that the two MCP table
entries exercise ``get_request_route`` directly rather than the full
production handler (which needs ASGI scope + MCP state to invoke).

---------

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: user <70670632+stuxf@users.noreply.github.com>
2026-05-25 13:44:49 -07:00
Yassin Kortam
30551de371
fix(otel): export SERVER span on management-endpoint success without http_request (#28794)
Co-authored-by: Yassin Kortam <yassinkortam@Yassins-MacBook-Pro.local>
2026-05-25 12:13:17 -07:00
Mateo Wang
f9407bc036
chore(tests): migrate Bedrock CI to AWS account 941277531214 (#28728)
* chore(tests): migrate Bedrock CI from AWS account 888602223428 to 941277531214

The original account (888602223428) was put under a security restriction by
AWS after a root access key leaked in a PR comment. While that account works
its way through the AWS Support unlock process, Bedrock-touching CI tests have
been migrated to a fresh account (941277531214).

Changes:
  - Replace 26 hardcoded references to 888602223428 with 941277531214 across
    8 files (provisioned-model ARNs, imported-model ARNs, AgentCore runtime
    ARNs, batch execution role ARN, and example proxy config).
  - The provisioned-model and imported-model ARNs are referenced only from
    mocked unit tests — no AWS resources to recreate.
  - The batch execution IAM role has been recreated in the new account with
    the same name and equivalent permissions.
  - The two AgentCore runtimes (hosted_agent_r9jvp-3ySZuRHjLC,
    hosted_agent_13sf6-cALnp38iZD) are being recreated in the new account
    under the same names — see tools/agentcore-deploy/ in a follow-up.

CircleCI env vars AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_REGION_NAME
were updated separately via the CircleCI API to point at the new account.

Smoke-tested locally against the new account:
  aws bedrock-runtime converse --region us-west-2 \
    --model-id us.anthropic.claude-sonnet-4-5-20250929-v1:0 \
    --messages '[{"role":"user","content":[{"text":"ping"}]}]'
  → 200, model returned 'pong'

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore(tests): refresh AgentCore ARN suffixes to match newly-deployed runtimes

The first migration commit replaced just the account ID, but AgentCore
auto-assigns a random 10-char suffix to every runtime on creation — we
can't reuse the original suffixes (`3ySZuRHjLC`, `cALnp38iZD`) in the
new account. Updated the AgentCore-runtime ARNs in the three files that
reference real runtime IDs (not the mock-based unit-test ARNs).

Deployed runtimes:
  arn:aws:bedrock-agentcore:us-west-2:941277531214:runtime/hosted_agent_r9jvp-Rq79QFC2fp
  arn:aws:bedrock-agentcore:us-west-2:941277531214:runtime/hosted_agent_13sf6-4046UzHSwy

Both runtimes are status=READY and pass a smoke invoke:
  $ aws bedrock-agentcore invoke-agent-runtime --agent-runtime-arn ... --payload '{"prompt":"ping"}'
  → 200, {"result": "echo: ping"}

The agent is a minimal echo (see /tmp/agentcore_deploy/agent.py for the
deploy artifacts). Tests that only verify the SDK wiring will pass; if any
test asserts on agent output content, swap the echo for the real agent.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore(tests): point Bedrock batch tests at new-account S3 bucket

The account migration (888602223428 -> 941277531214) was a flat
account-ID swap, which only rewrites ARNs that embed the account
number. S3 bucket names carry no account ID, so the live Bedrock
batch tests still uploaded to `litellm-proxy` — a bucket that lives
in the old account. S3 names are globally unique, and the old account
still holds that name, so it can't be recreated in the new account.

Rename to `litellm-proxy-941277531214` (account-ID suffix guarantees
global uniqueness). The bucket must be created in 941277531214 and the
batch execution role granted s3:GetObject/PutObject/ListBucket on it
before this job is run in CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(tests): point live S3 logging test at new-account bucket

Same account-ID-free blind spot as the batch bucket: `load-testing-oct`
lives in the old account and its name can't be reused globally. The
`logging_testing` CI job is wired into the workflow and runs
test_basic_s3_logging, which uploads to this bucket with the CI env
creds, then lists and deletes objects — a live dependency.

Rename to `load-testing-oct-941277531214`. The bucket must exist in the
new account with the CI IAM principal granted
s3:PutObject/GetObject/ListBucket/DeleteObject before this job runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(tests): repoint Bedrock guardrail IDs to new-account guardrails

The migration left guardrail IDs untouched (no account ID in them), so
all live guardrail tests failed with "guardrail identifier or version
does not exist" against 941277531214. Recreated both guardrails in the
new account and updated the hardcoded IDs:
  - wf0hkdb5x07f -> zgkmukebruil (PII mask: PHONE + CREDIT_DEBIT_CARD,
    with explicit inputAction=ANONYMIZE so masking applies to INPUT,
    which is the source litellm's moderation hook sends)
  - ff6ujrregl1q -> 4w3d1di3snt5 (blocks "coffee"; blocked message set
    to the exact string the tests assert on)

Updated test_bedrock_guardrails.py, otel_test_config.yaml, and the
guardrailConfig in test_bedrock_completion.py. Verified locally: the 5
previously-failing guardrail tests now pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(bedrock): migrate legacy models to current inference profiles

The new CI account (941277531214) cannot invoke legacy Bedrock models
(AWS gates them: "marked by provider as Legacy... not actively using in
the last 30 days"). Migrated the live-call tests:
  - anthropic.claude-3-sonnet-20240229    -> us.anthropic.claude-sonnet-4-5-20250929-v1:0
  - anthropic.claude-3-haiku-20240307     -> us.anthropic.claude-haiku-4-5-20251001-v1:0
Current Claude models on Bedrock require the us. inference-profile prefix
(bare on-demand ids are rejected).

cohere.command-r-plus has no working replacement (all Cohere is legacy-
gated in the new account): swapped to claude-haiku-4-5 in provider-
agnostic param lists. amazon.titan-image-generator skipped (no working
replacement). Mocked/transformation/cost tests that reference the legacy
strings are intentionally left unchanged. Verified live against the new
account.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(bedrock): repoint SageMaker + Knowledge Base to new-account resources

These referenced account-scoped resources by hardcoded id that only
existed in the old account, so the migration's account-ID swap missed
them. Recreated in 941277531214 and repointed:
  - SageMaker endpoint jumpstart-dft-hf-textgeneration1-mp-20240815-185614
    -> litellm-ci-textgen (gpt2 on a TGI container, ml.g5.xlarge)
  - Bedrock Knowledge Base T37J8R4WTM -> LCYXFBR2TU (OpenSearch Serverless
    vector store + titan-embed-text-v2, seeded with a LiteLLM doc)
Verified live: test_sagemaker.py (12 passed) and
test_bedrock_knowledgebase_hook.py (12 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(reasoning_effort_grid): skip bedrock claude-opus-4-7 cells (not entitled on 941277531214)

claude-opus-4-7 is listed in the new Bedrock CI account's foundation
models but invoke is denied (AccessDeniedException: "not available for
this account"). Bedrock access to the flagship Opus requires an AWS
Sales request, not the self-serve model-access toggle, so it can't be
enabled inline with the rest of the account migration.

Add an optional `skip_reason` to ModelEntry and set it on the
bedrock-claude-opus-4-7 entry; the grid test honors it via pytest.skip.
Cell count (231) and route coverage are unchanged, so the structural
asserts still pass. Restore coverage by deleting the one skip_reason
line once access is granted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(bedrock): swap/skip legacy-gated models unavailable on new CI account

The migrated AWS account (941277531214) cannot access several models that
the old account could, so the remaining red CI jobs were hitting real
Bedrock "Access denied / Legacy" and "account not authorized" errors:

- image_gen: skip both Nova Canvas test classes (amazon.nova-canvas-v1:0 is
  legacy-gated), matching the existing titan skip.
- batches: skip test_async_file_and_batch (Bedrock batch inference is not
  authorized on the new account; requires an AWS support case).
- litellm_overhead: swap legacy claude-3-5-haiku for the active
  us.anthropic.claude-haiku-4-5 inference profile.
- test_completion_claude_3_function_call: swap legacy claude-3-sonnet for the
  active us.anthropic.claude-sonnet-4-5 inference profile.

https://claude.ai/code/session_01Y7zgHYu9GX29YRwV4yiWAa

* test(bedrock): fix remaining e2e legacy-model + batch failures on new CI account

- e2e_openai_endpoints: skip test_bedrock_batches_api (Bedrock batch inference
  is not authorized on account 941277531214) and migrate the missed
  s3_bucket_name in oai_misc_config.yaml to litellm-proxy-941277531214.
- build_and_test: swap legacy bedrock claude-3-sonnet for the active
  us.anthropic.claude-sonnet-4-5 inference profile in the proxy structured
  output e2e test.

https://claude.ai/code/session_01Y7zgHYu9GX29YRwV4yiWAa

* test(bedrock): make opus-4-7 + batch cells fail loudly and mock image-gen (#28791)

Replace the silent skips added for the new CI account with noisier behavior:
- reasoning-effort grid: opus-4-7 cells now fail (when AWS creds are present)
  instead of skipping, so the missing entitlement stays visible in CI; they
  still skip when AWS creds are absent (local dev)
- Bedrock batch inference tests: drop the skip so they run and fail until
  batch access is granted
- Titan + Nova Canvas image-gen tests: mock the Bedrock HTTP call so the
  transform + cost-tracking path stays under test without live model access

https://claude.ai/code/session_01MT7SWDnXUjv6e6EPG7BDjT

Co-authored-by: Claude <noreply@anthropic.com>

* test(bedrock): use pytest.xfail for known-failing opus-4-7 cells

Replace pytest.fail with pytest.xfail when a model has a fail_reason,
so known-broken cells stay visible as XFAIL without keeping CI red.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

---------

Co-authored-by: Mateo <mateo@Mateos-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
2026-05-25 12:03:17 -07:00
milan-berri
f45909cb81
fix(proxy): Bedrock Knowledge Base pass-through: preserve SigV4 headers and signed request body (#27526)
* Fix Bedrock KB pass-through SigV4 headers and signed body

Coerce botocore HeadersDict to a dict for pass-through routes. When
forward_headers is true, drop request headers that collide case-insensitively
with signed headers so client Bearer auth does not shadow AWS SigV4.
Send prepped.body as raw content so the outbound payload matches the
signature after logging hooks mutate the parsed dict.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Simplify pass-through raw body handling

Read the SigV4-signed bytes directly from request.state inside
pass_through_request instead of threading a custom_raw_body argument
through three functions. Helper methods are restored to their original
signatures, and the new branch lives in one place at each httpx call site.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Harden pass-through raw body read from request.state

Guard missing request.state (test fixtures) and ignore non-bytes/str
values so MagicMock does not trigger the SigV4 raw-body path.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Test pass_through_request state_raw_body uses httpx content=

Cover non-streaming (async_client.request) and streaming (build_request)
paths so SigV4 bytes on request.state are not replaced by json= of a
hook-mutated dict.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-25 19:21:55 +05:30
ryan-crabbe-berri
4148667671
Fix spend logs v2 route permissions (#28705)
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: ryan-crabbe-berri <ryan-crabbe-berri@users.noreply.github.com>
2026-05-23 16:57:14 -07:00
ryan-crabbe-berri
92d4bba58f
fix(ui/add-model): stop vertex_ai-anthropic_models from leaking under Anthropic (#28723)
`getProviderModels()` matched a model into a provider's dropdown when the
model's `litellm_provider` string *contained* the provider key as a
substring. The intent was to admit suffix variants (e.g. `anthropic_text`,
`bedrock_converse`), but the substring check is too loose: it also pulls in
unrelated providers whose name happens to contain the key, most visibly
`vertex_ai-anthropic_models` matching `anthropic` and `vertex_ai-openai_models`
matching `openai`.

Replace `.includes()` with separator-anchored prefix matching
(`startsWith(provider + "_")` / `startsWith(provider + "-")`). All legitimate
variants in `model_prices_and_context_window.json` still match
(`anthropic_text`, `azure_text`, `azure_ai`, `bedrock_converse`,
`bedrock_mantle`, `cohere_chat`, `fireworks_ai-embedding-models`,
`vertex_ai-*`, `vertex_ai_beta`), and the cross-provider leak is closed.

Tests: update one assertion that pinned the buggy substring behavior
(`custom_openai_endpoint` matching `openai` — not a real provider value);
add 6 new tests covering the leak regressions and the variant-preservation
contract for vertex_ai/bedrock/fireworks.
2026-05-23 16:56:52 -07:00
yuneng-jiang
5f73ad4fe7
fix(team): refresh team cache on team_model_add/delete (LIT-3244) (#28683)
* fix(team): refresh team cache on team_model_add/delete (LIT-3244)

team_model_add and team_model_delete wrote to the DB but did not
invalidate the in-memory LiteLLM_TeamTableCachedObj used by
common_checks. After the v1.83.14 common_checks centralization made
team.models authoritative on /v1/files and /v1/vector_stores/*,
adding a Team-BYOK model silently failed to grant the new public
model name to team members until the cache TTL expired (and a
removed model kept working until then on the symmetric path).

Extract the cache-refresh snippet from update_team into a small
helper and apply it consistently at all three team-write sites.

* test: also assert updated models in team-cache-refresh pin

Strengthens the LIT-3244 regression test to also assert
`call_kwargs["team_table"].models` matches the updated row,
not just `team_id`. Both `existing_team` and `updated_team`
share `team_id` in the test setup, so the previous assertion
would have passed even if the implementation accidentally cached
the pre-mutation row.

Greptile review feedback.

* fix(team): hydrate object_permission on cache-refreshing team updates

The Prisma update calls in update_team, team_model_add, and
team_model_delete returned a team row with object_permission_id set
but object_permission=None (the relation was not requested via
include=). _refresh_cached_team then wrote that to the in-memory
LiteLLM_TeamTableCachedObj, and the cache-hit path in get_team_object
returns the cached object without re-hydrating. Downstream consumers
(validate_key_search_tools_against_team, the MCP/agent authz paths)
treat a missing object_permission as no team-level restriction, so
a team-write op silently dropped object-permission enforcement until
the cache TTL expired or a DB-fetch path re-hydrated it.

Add include={"object_permission": True} to all three updates so the
refresh writes a complete cached team. Extend the LIT-3244 regression
test to pin both the cached object_permission and the include shape
on the Prisma call.

Surfaced in PR review of LIT-3244.
2026-05-23 16:41:05 -07:00
yuneng-jiang
3bcfe41f05
test(model_prices): allow audio_transcription_config in schema (#28708)
The schema in test_aaamodel_prices_and_context_window_json_is_valid uses
additionalProperties: false. The azure/speech/azure-stt entry added in
#27482 introduced an audio_transcription_config field that the schema
did not whitelist, so the test fails on every branch built on top of
staging.

Add the field as a string property.
2026-05-23 16:19:43 -07:00
yuneng-jiang
7c667b8797
fix(helm): drop main- prefix from default image tag (#28710)
* fix(helm): drop main- prefix from default image tag

The default image tag in the deployment + migrations-job templates was
`main-{{ .Chart.AppVersion }}`. The current release pipeline publishes
content tags without the `main-` prefix (e.g. `v1.85.1` / `1.85.1`,
`v1.86.0-rc.1` / `1.86.0-rc.1`), so the rendered ref points at a tag
that does not exist on GHCR or DockerHub and installs fail with
ImagePullBackOff.

- templates/deployment.yaml, templates/migrations-job.yaml: render
  `.Chart.AppVersion` directly instead of `main-<AppVersion>`.
- Chart.yaml: bump stale `appVersion: v1.80.12` (not on either
  registry) to `v1.85.1` so local-checkout installs also resolve.
- values.yaml: update the commented tag-override hint to match.

* fix(helm): use :latest in tag override example, not pinned version

Per review: ghcr.io/berriai/litellm-database:latest is a floating
alias for the most recent stable (same digest as :main-stable),
maintained by the release pipeline's UPDATE_LATEST advance step.
Better example than a pinned version that goes stale.
2026-05-23 15:57:38 -07:00
yuneng-jiang
8513d7fc0c
chore: update Next.js build artifacts (2026-05-23 19:21 UTC, node v20.20.2) (#28707) 2026-05-23 13:05:26 -07:00
ryan-crabbe-berri
886e91b85e
fix(otel): stamp http.response.status_code on all error responses (#28405)
* fix(otel): stamp http.response.status_code on all error responses

httpx.HTTPStatusError exposes status under .response.status_code, not as a
top-level attr, so unified-endpoint 5xx failures left the SERVER span without
a status. The admin hooks only wrote a child span and never stamped or ended
the parent at all, so admin 4xx/5xx (and success) responses were invisible
to dashboards. Adds a fallback to .response.status_code in get_error_information,
and ends the parent SERVER span in async_management_endpoint_{success,failure}_hook
with the same _record_exception_on_span helper the unified path uses.

Resolves LIT-3193

* test(otel): exercise httpx.HTTPStatusError through admin path

Pins the contract that get_error_information's response.status_code fallback
is reachable from any entry point — without this, a future refactor that
bypasses _record_exception_on_span in the admin hooks could regress for
httpx-wrapped exceptions while the unified suite still passes.

* chore(otel): trim verbose comments in LIT-3193 changes

Tighten docstrings and remove redundant section dividers/inline narration.
Behavior is unchanged.

* fix(otel): set span.status on management hook parent SERVER span

Mirror the unified failure path: stamp StatusCode.ERROR on the parent
SERVER span before recording the exception, and StatusCode.OK before
ending it on success. Without this, OTEL backends filtering on span
status (the idiomatic primitive) miss admin-endpoint failures even
though the http.response.status_code attribute is correct.

Extend assert_server_span_attrs to assert span.status.status_code
matches the expected outcome so the gap can't regress.

* fix(otel): close SERVER span on body-validation and unhandled errors

Stash the SERVER span on request.state in auth so FastAPI exception
handlers can finish it for failures that occur after auth but before
the route handler (e.g. /model/new TypeError, /key/generate
RequestValidationError). Without this, those requests left dangling
spans missing http.response.status_code.

Resolves LIT-3193

* fix(otel): generic 500 body, log exception details server-side

Don't leak str(exc) and type(exc).__name__ to clients on uncaught
exceptions. The full traceback is logged via verbose_proxy_logger and
the SERVER span still gets http.response.status_code=500.

Resolves LIT-3193

* fix(otel): stamp http.response.status_code on every SERVER span path

Closes three remaining gaps where the proxy SERVER span ended without
the http.response.status_code attribute:

1. ProxyException raised from _read_request_body (e.g. invalid JSON
   body) bubbled out of user_api_key_auth before the SERVER span was
   created, so the FastAPI handler had nothing to close and the trace
   never reached the backend. Hoist the span creation to a new
   idempotent _ensure_parent_otel_span_on_request_state helper called
   at the top of user_api_key_auth; wire openai_exception_handler to
   close the dangling span. Covers /v1/chat/completions, /v1/messages,
   /v1/responses (shared handler).

2. /v1/responses success — _handle_success ends the proxy span before
   async_post_call_success_hook fires on this path, so the hook's
   set_response_status_code_attribute(200) silently no-op'd against an
   ended span. Stamp 200 + set OK status at the close site in
   _handle_success / _end_proxy_span_from_kwargs via a shared
   _close_proxy_span_ok helper, so the attribute lands regardless of
   which success hook runs first.

3. Failure path for exceptions without code/status_code (e.g. a bare
   TypeError surfacing through _handle_llm_api_exception) — empty
   error_information.error_code → _record_exception_on_span skips the
   stamp → the hook ends the span. Default to 500 in
   async_post_call_failure_hook so the attribute is always set.

Resolves LIT-3193
2026-05-23 19:21:32 +00:00
ishaan-berri
14c0a2b3e2
feat(prometheus): emit per-token-type detail metrics (LIT-3220) (#28372) (#28378)
* feat(prometheus): emit per-token-type detail metrics (LIT-3220) (#28372)

Adds five sparse counter metrics that break out the token detail
fields providers already report in `usage.prompt_tokens_details` and
`usage.completion_tokens_details`:

  - litellm_input_cached_tokens_metric            (provider prompt-cache reads)
  - litellm_input_cache_creation_tokens_metric    (Anthropic prompt-cache writes)
  - litellm_input_audio_tokens_metric             (audio input tokens)
  - litellm_output_reasoning_tokens_metric        (reasoning tokens)
  - litellm_output_audio_tokens_metric            (audio output tokens)

These are additive — existing input/output/total counters are
unchanged, so no dashboards break. Each new counter is only
incremented when the underlying detail is populated and > 0, keeping
scrape output sparse for providers that don't report a given field.

Data is read from the canonical Usage dict that
`get_standard_logging_object_payload` already attaches at
`standard_logging_payload["metadata"]["usage_object"]`, so no new
plumbing through the logging pipeline is required.

Tests: 10 new unit tests covering registration, label-set parity,
all-types increment, zero/None/negative skip behaviour, and the
no-metadata/no-usage_object no-op paths.

Closes LIT-3220

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: Krrish Dholakia <krrishdholakia@berri.ai>
Co-authored-by: Claude <noreply@anthropic.com>

* chore: remove proof folder image

---------

Co-authored-by: oss-agent-shin <ext-agent-shin@berri.ai>
Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: Krrish Dholakia <krrishdholakia@berri.ai>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Ishaan Jaffer <ishaanjaffer0324@gmail.com>
2026-05-23 12:17:42 -07:00
yuneng-jiang
5e16f20962
test(proxy): phase-4 payload behavior pinning for tier-2/3 key + team management endpoints (#28681)
* test(proxy): phase-4 payload behavior pinning for tier-2/3 key + team management endpoints

Extends the Phase 1–3 behavior-pin suite at tests/proxy_behavior/management/
with a second axis: payload-shape pinning. Phase 1–3 held payload minimal
and pinned (actor, target) → status across 37 routes; Phase 4 holds the
caller fixed at an authorized actor, varies the payload shape, and asserts
the observable DB effect (on accept) or the named guard / row-unchanged
(on reject). Faithfulness contract from Phase 1–3 is unchanged.

Six families + one gap-closer (59 new scenarios, 620 → 679 total):

  * F1 — key budget / rate-limit (test_key_budget_limits.py, 18)
  * F2 — key↔team reassignment   (test_key_team_change.py, 6)
  * F3 — team budget / rate-limit (test_team_budget_limits.py, 15)
  * F4 — member-info validation   (test_team_member_info_validation.py, 5)
  * F5 — permission batching      (test_team_permissions_bulk_update.py, 6)
  * F6 — org-scoped team access   (+2 detail-string pins in existing files)
  * F7 — coverage gap-closer      (test_f7_coverage_closeout.py, 7)

Harness extensions in conftest.py (additive only):
  * create_scratch_org() seeder with its own scratch-prefixed budget row
  * budget / limit fields on create_scratch_team()
  * scratch teardown also sweeps litellm_organizationtable

Coverage telemetry (behavior-suite-only):
  * key_management_endpoints.py  60 % → 65 % (+82 lines)
  * team_endpoints.py            62 % → 72 % (+137 lines, crosses 70 % stretch)

Key lands under 70 % per plan §7 escape hatch — the gap is dominated by
routes outside F1–F6 scope (key list/info v2 internals) and structurally
dead org-budget guards (call sites at lines 889 + 2310 + 985 + 1751 load
the org without include_budget_table=True, so org.litellm_budget_table is
None at guard time and the aggregate guard no-ops). Pinned as observed
no-op behavior so a future fix that flips the flag turns these into reds.

Zero source-code changes; pyproject.toml diff is empty;
test_route_coverage.py stays green untouched; G3 grep guards still green;
local wall-time 14 s for the full suite (no coverage), 22 s with coverage.

G4 regression-replay protocol executed against three representative
fix-PR parents (410ce761dc, 0bd49ecb8b, 8bbc61e03c): all Phase 4 tests
PASS at pre-fix SHAs — confirming the F1–F7 layer is a helper-body pin,
not a regression-replay layer for those specific historical bypass
shapes. Targeted RED-bait scenarios for each fix are left for a
follow-up PR.

* test(proxy): push key_management_endpoints.py past the 70% stretch (F7-extension)

Adds 24 more payload-pin scenarios in test_f7_key_coverage_push.py
following the same accepted-effect / rejected-guard pattern. Each
scenario cites the file:line range it pins; same anti-snapshot rules
apply.

Target ranges (all reachable via HTTP-boundary payload variation):
  * 5942-6063  /key/health with metadata.logging → test_key_logging body
  * 4565-4692  /key/reset_spend happy + 404 + non-admin gate + value validation
  * 4421-4533  /key/regenerate ghost-404 + happy + new_key + grace_period
  * 4168-4202  _insert_deprecated_key body via grace_period
  * 6118-6133  _enforce_unique_key_alias duplicate-alias rejection
  * 6148-6169  validate_model_max_budget malformed-payload rejection
  * 4708-4789  validate_key_list_check user/team/org/key_hash branches
  * 2622-2733  /key/bulk_update mixed success/failure + admin gate + size limits
  * 2797-2950  /team/key/bulk_update all-keys path + explicit-keys dedupe + 404
  * 5108-5207  /key/aliases admin + scoped + search-filter branches
  * 3253-3303  /key/info ghost + explicit-key + no-key-uses-auth-header
  * 3427-3436  generate_key_helper_fn budget_limits initialization
  * 1794-1815  prepare_key_update_data duration + budget_duration paths
  * 5280-5388  _build_filter_conditions across include_created_by_keys/team/sort/alias

Coverage telemetry — full PR4 dataset:

  key_management_endpoints.py: 60 % → 71 %  (+11 pts, +194 lines)
  team_endpoints.py:           62 % → 72 %  (+10 pts, +137 lines)

Both files now over the plan §7 PR4.M4 70 % stretch as a side effect of
pinning real payload behavior. 721 tests pass in 19 s local (full suite,
no coverage); 27 s with coverage. Zero source-code changes; pyproject.toml
diff still empty; test_route_coverage.py + G3 grep guards still green.

Honest finding (kept from the prior commit's body): four structurally-dead
org-budget guards remain pinned as observed no-op behavior — they fire
only when get_org_object is called with include_budget_table=True, which
none of the four management-endpoint call sites currently do. Pinned so
a future change that flips the flag turns these into reds.

Two helper guards are honest-ceiling: _validate_reset_spend_value's
isinstance check at line 4568 is unreachable from HTTP because Pydantic
422s non-float before the helper runs; same shape for /team/key/bulk_update's
missing team_id / no-selector pre-handler guards.

* test(proxy): address PR review — try/finally cleanup + loosen 500 envelope pins + Optional annotations

Greptile review feedback on PR #28681:

1. Wrap manual budget-row cleanup in try/finally so an assertion failure
   doesn't leave non-scratch-prefixed budget rows orphaned across CI re-runs
   (test_team_new_with_team_member_budget_creates_budget_row and
   test_team_update_team_member_budget_upserts).
2. Loosen the two 500-status pins to in (400, 422, 500) — the named-guard
   substring is the real pin; the outer ValueError-wrap envelope is an
   implementation detail that a future improvement should be free to fix
   to a proper 400/422 without flipping these tests red.
3. Add missing Optional annotations on _seed_token's max_budget / metadata
   / team_id keyword args (they default to None).

Greptile's typo flag on 'read-world' in the conftest comment is declined —
'read-world' is the project's established term for the immutable seeded
world fixture (see other usages in conftest.py and actors.py).

721 tests still pass in 17 s.
2026-05-23 12:16:29 -07:00
ishaan-berri
203b529c9d
feat(azure): add speech transcription config support (#27482)
Co-authored-by: oss-agent-shin <279349115+oss-agent-shin@users.noreply.github.com>
Co-authored-by: ishaan-berri <ishaan-berri@users.noreply.github.com>
2026-05-23 12:16:01 -07:00
Yassin Kortam
2eab9ee2c0
perf: reduce per-request and per-chunk overhead across Anthropic streaming hot paths (#28289)
* perf: reduce per-request and per-chunk overhead across Anthropic streaming hot paths

- Introduce pure-text fast-path in `_build_complete_streaming_response` that collapses O(N) `content_block_delta` events into a single equivalent SSE event before conversion, eliminating per-output-token Pydantic `ModelResponseStream` construction; non-text streams (tool_use, thinking, citations) fall back to the unchanged legacy path
- Skip agentic streaming wrapper entirely when no callback overrides `async_should_run_agentic_loop`; the wrapper buffered every chunk and rebuilt the SSE response only to call hooks that all return `(False, {})` — a pure no-op for the default config
- Serialize request body once (`json.dumps`) for both the pre-call log input and the wire, instead of twice; avoids a full O(payload) scan per request, significant for long-context Claude Code histories
- Add fast path in `async_streaming_data_generator` that bypasses the per-chunk `async_post_call_streaming_hook` coroutine await, response-string materialization, and cost-injection call when no callback/guardrail/cost-injection is active (the default config)
- Resolve `_DD_STREAMING_TRACE_ENABLED` once at import time; eliminate per-chunk `NullSpan` context manager allocation when Datadog tracing is disabled (the default)
- Memoize `get_type_hints(AnthropicMessagesRequestOptionalParams)` with `@lru_cache(maxsize=1)` — resolves once per process instead of once per `/v1/messages` request (~80µs each)
- Hoist `cost_injection_active` out of the per-chunk loop in `chunk_processor`; eliminates repeated `getattr` + endpoint-type checks on every streamed byte chunk
- Extract `_build_passthrough_logging_result` from `_route_streaming_logging_to_handler` as a standalone static method to facilitate future off-loop dispatch
- Convert `async_sse_data_generator` from an `async for: yield` trampoline to a direct return of the underlying generator, removing one async-generator layer per streamed chunk
- Skip redundant `strip_empty_text_blocks_from_anthropic_messages` scan in `anthropic_messages_handler` when the async wrapper already sanitized (signalled via `_litellm_messages_presanitized` sentinel, popped before reaching provider params)
- Gate debug log `f-string` evaluation behind `isEnabledFor(DEBUG)` in both the streaming generator and the transformation layer to avoid serializing entire message payloads on every request at non-debug log levels
- Add benchmark script (`scripts/benchmark_anthropic_messages_perf.py`) with a local mock Anthropic SSE provider for reproducible TTFT and TPM measurement across commits/branches
- Add parity tests asserting fast-path and legacy-path produce byte-identical logged/billed payloads, plus unit tests for agentic hook detection, pre-serialized body reuse, and memoized key resolution

* perf: address greptile review for anthropic streaming hot path

- Bail to legacy in `_collapse_pure_text_chunks` when content_block_delta
  events from different block indexes are observed without an intervening
  flush. Anthropic sends blocks strictly sequentially, but defensive bail
  prevents silent text-merging if the protocol ever interleaves.
- Replace leaf-class `__dict__` check for `async_post_call_streaming_hook`
  in `_callback_capabilities` with a function-identity comparison that
  walks the MRO. A vendor base class can carry the override and the
  registered class can add nothing else; before this PR the hook was
  unconditionally invoked, so an inherited-override miss would silently
  drop the hook on the streaming path.
- Add unit tests for both behaviors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(mypy): narrow model_name to str in cost-injection branch

The hoisted cost_injection_active flag in chunk_processor encodes the
`bool(model_name)` requirement but mypy can't track that invariant
through the local, so the per-chunk `_process_chunk_with_cost_injection(
chunk, model_name)` calls flagged Optional[str] vs str. Pin a typed
non-None local inside the cost-injection branch so mypy narrows
correctly without changing runtime behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 12:15:59 -07:00
Michael-RZ-Berri
3b2ce201d8
encrypt callback_vars in key/team metadata at rest (#27141)
Co-authored-by: Michael Riad Zaky <michaelr@Michaels-MacBook-Air.local>
Co-authored-by: Yuneng Jiang <yuneng@berri.ai>
2026-05-23 12:15:44 -07:00
Mateo Wang
492891cad8
CI: copy of #25177 (OCI GenAI: embeddings, streaming/reasoning fixes, model catalog) (#28223)
* fix(opentelemetry): JSON-serialize dict metadata fields for OTEL span attributes (#27451) (#27455)

Squash-merged by litellm-agent from Anai-Guo's PR.

* feat(dashscope): add embeddings and reranks(qwen3-rerank) support via OpenAI-compatible endpoint (#27508)

Squash-merged by litellm-agent from yimao's PR.

* fix(vertex_ai/gemini): raise BadRequestError when image_url or url fi… (#24550)

Squash-merged by litellm-agent from krisxia0506's PR.

* fix(vertex_ai): raise error on mid-stream 429/error chunks instead of silently swallowing (#23711)

Squash-merged by litellm-agent from krisxia0506's PR.

* fix: raise BadRequestError for file content blocks missing 'file' sub… (#24503)

Squash-merged by litellm-agent from krisxia0506's PR.

* Fix Gemini MIME detection for extensionless GCS URIs (#27278)

Squash-merged by litellm-agent from krisxia0506's PR.

* fix(vertex_ai/partner_models): drop unused vertexai SDK gate from count_tokens (closes #28084) (#28107)

Squash-merged by litellm-agent from voidborne-d's PR.

* feat(chart): add support for autoscaling behavior in HPA (#27990)

Squash-merged by litellm-agent from FabrizioCafolla's PR.

* feat(proxy): add blocked flag to models for pause/resume from the UI (#27927)

Squash-merged by litellm-agent from Cyberfilo's PR.

* fix: pass socket timeouts to Redis cluster clients (#27920)

Squash-merged by litellm-agent from tomdee's PR.

* Fix/cache token (#28009)

Squash-merged by litellm-agent from escon1004's PR.

* fix(deepseek): forward reasoning_content in multi-turn thinking mode conversations (#28080)

Squash-merged by litellm-agent from Divyansh8321's PR.

* fix(guardrails): return HTTP 400 instead of 500 for blocked requests (#27617)

* fix: reset org and tag budgets (#27326)

* reset org budgets

* reset tag budgets

---------

Co-authored-by: Michael Riad Zaky <michaelr@Mac.localdomain>

* fix(ui): omit allowed_routes from key edit save when unchanged (#27553)

* fix(ui): omit allowed_routes from key edit save when unchanged

When a team admin opens Edit Settings on a key with key_type=AI APIs and
saves without changing anything, the UI re-sends the existing allowed_routes
value, which the backend's _check_allowed_routes_caller_permission gate
rejects for non-proxy-admins (LIT-2681).

Strip allowed_routes from the patch in handleSubmit when it deep-equals the
original keyData.allowed_routes. The backend treats absence as "leave alone,"
so no-op saves now succeed for non-admins. Admins explicitly editing the
field still send the new value.

* fix(ui): order-insensitive allowed_routes diff + cover null-original case

Address Greptile review:

- Switch the "is allowed_routes unchanged" check to a Set-based comparison so
  a server-side reorder of the array doesn't register as a user edit and
  re-trigger LIT-2681.
- Add two regression tests: (1) keyData.allowed_routes is null and the form
  is untouched — patch should strip the field; (2) server returned routes in
  a different order than the user originally entered — patch should still
  recognize the value as unchanged.

* chore(ui): strip ticket refs and tighten comments in key edit fix

- Remove internal-tracker references from in-code comments
- Tighten the WHY comment in handleSubmit to two lines
- Drop redundant test-block comments — test names already describe the case

* fix(ui): annotate Set<string> generic in allowed_routes diff to fix tsc

* fix(guardrails): return HTTP 400 instead of 500 for guardrail-blocked requests

GuardrailRaisedException and BlockedPiiEntityError both lacked a
status_code attribute.  When these exceptions reached the proxy
exception handler (getattr(e, 'status_code', 500)), the fallback
defaulted to HTTP 500 — making intentional guardrail blocks
indistinguishable from server errors and causing unnecessary client
retries.

Changes:
- Add status_code=400 (keyword-only) to GuardrailRaisedException
- Add status_code=400 (keyword-only) to BlockedPiiEntityError
- Update _is_guardrail_intervention() to recognize both exceptions
  so downstream loggers record 'guardrail_intervened' instead of
  'guardrail_failed_to_respond'
- Add 6 unit tests for default/custom status codes and getattr pattern
- Strengthen existing blocked-action test with status_code assertion

Fixes #24348

---------

Co-authored-by: Michael-RZ-Berri <michael@berri.ai>
Co-authored-by: Michael Riad Zaky <michaelr@Mac.localdomain>
Co-authored-by: ryan-crabbe-berri <ryan@berri.ai>
Co-authored-by: Krrish Dholakia <krrish+github@berri.ai>

* fix(router/proxy): address Greptile P1+P2 review comments on PR #28161

- router: raise ServiceUnavailableError (503) instead of RouterRateLimitErrorBasic (429)
  when a specifically-addressed deployment is administratively blocked; 429 misleads
  retry-enabled clients into spinning forever against a paused model
- proxy_server: compute get_fully_blocked_model_names() once before both branches in
  model_list() instead of duplicating the call in each branch
- deepseek: upgrade silent debug log to warning when injecting placeholder
  reasoning_content so callers are clearly notified of degraded multi-turn quality
- tests: update two blocked-deployment assertions to expect ServiceUnavailableError

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix: address bug detection findings (cache token order, mutable defaults)

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix: address bugs in async pass-through, anthropic cache token detection, rerank tests

- async_get_available_deployment_for_pass_through: enforce blocked check on specific deployments
- cost_calculator: detect anthropic-style usage by attribute presence (not truthiness) to avoid mixing OpenAI cached_tokens into anthropic normalization when read=0
- dashscope rerank tests: pass request to httpx.Response constructions for consistency

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix code qa

* fix(vertex_ai/gemini): strip MIME parameters from GCS contentType

GCS object metadata's contentType field can include parameters such as
'text/html; charset=utf-8'. Strip them in _apply_gemini_mime_type_aliases
so downstream get_file_extension_from_mime_type sees a bare MIME type.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(vertex_ai/gemini): clarify mime-type error message string concatenation

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* feat(oci): add embeddings, fix streaming/reasoning, expand model catalog

- Add OCIEmbedConfig with full Cohere embed support (7 models, batch up to 96)
- Fix sync streaming: split SSE events on \n\n before JSON parsing
- Fix reasoning models (Gemini 2.5, xAI Grok): make completionTokens and message
  optional in OCIResponseChoice to handle max_tokens exhausted on reasoning
- Fix compartment_id resolution in chat transform to use resolve_oci_credentials
- Fix tool call id: make OCIToolCall.id optional, generate UUID fallback for
  providers (Google via OCI) that omit it
- Add OCI_KEY env var support for inline PEM keys
- Fix datetime.utcnow() deprecation in request signing
- Expand model catalog: 29 OCI models including Llama 4, Gemini 2.5, xAI Grok,
  Cohere Command A, and all Cohere embed variants
- Add 37 live integration tests: sync/async completions for Meta/Google/xAI/Cohere,
  sync/async embeddings, tool use across all vendors, streaming, env var auth
- Add 23 embed unit tests covering all transform and validation paths

* fix(oci): remove dead OCI elif branch in utils.py, align async split_chunks with sync version

* test(oci): add unit tests for split_chunks fix and no-duplicate-OCI-branch guard

* fix(oci): address remaining bugs from issue #25082 — streaming signed body, Cohere stop sequences, hardcoded defaults

- Bug 1: sync and async streaming paths now use signed_json_body when provided
  instead of re-serializing data with json.dumps() — the OCI RSA-SHA256 signature
  covers the exact request body bytes, so re-serializing produces an invalid sig
- Bug 3: Cohere stop sequences now map to 'stopSequences' (was incorrectly 'stop')
- Bug 4: removed hardcoded Cohere defaults (maxTokens=600, temperature=1, topK=0,
  topP=0.75, frequencyPenalty=0) that silently overrode user intent on every call
- Added 6 unit tests covering all three fixes

* fix(oci): comprehensive code quality pass — bugs, tests, schema accuracy

- Fix Cohere tool call IDs (was always call_0; now UUID per call)
- Fix TOOL_CALL finish reason mapping in both sync and streaming paths
- Fix Cohere stop parameter mapping (stop → stopSequences)
- Remove hardcoded Cohere defaults (maxTokens/topK/topP/frequencyPenalty)
- Fix content[0] safety guard against empty content arrays
- Fix streaming signed body used consistently (not re-serialized)
- Raise OCIError (not bare Exception/ValueError) throughout
- Centralize OCI_API_VERSION constant; import uuid at module level
- Fix embed get_complete_url to strip trailing slashes from api_base
- Fix OCIEmbedResponse schema: add inputTextTokenCounts (actual OCI field)
- Fix embed usage computed from inputTextTokenCounts (sum of per-input counts)
- Fix Cohere toolCallId included in tool result messages
- Add OCIToolCall.id as Optional (absent in Google/xAI streaming chunks)
- Update tests to reflect correct behavior (no hardcoded defaults, UUID ids,
  deferred credential validation, OCIError vs ValueError, real response schema)

* test(oci): move integration tests to tests/llm_translation/

Addresses greptile P1: tests/test_litellm/ is for mock-only unit tests
(make test-unit target). Real-network OCI tests now live in the correct
location alongside other provider integration tests.

* fix(oci): align types and transformation with official OCI SDK

- Remove OCIVendors.GEMINI — apiFormat="GEMINI" is invalid; all non-Cohere
  models use apiFormat="GENERIC"
- Add toolChoice, logitBias, logProbs to OCIChatRequestPayload so params
  present in the mapping are no longer silently dropped by Pydantic
- Exclude n→numGenerations from Cohere param map (not a Cohere API field)
- Fix CohereToolResult: change callId/result to call/outputs matching
  the OCI SDK's CohereToolResult structure
- Fix CohereToolMessage: replace non-existent toolCallId with toolResults
  list; update adapt_messages_to_cohere_standard to build proper tool-result
  history entries by resolving tool call name+params from preceding assistant
  messages
- Map generic-model stream finish reasons to OpenAI convention
  (COMPLETE→stop, MAX_TOKENS→length, TOOL_CALLS→tool_calls), consistent
  with the existing Cohere streaming path
- Add optional id field to OCIEmbedResponse so valid API responses
  carrying an id are not rejected by the Pydantic model

* fix(oci): use 'output' key in Cohere tool result outputs (matches reference impl)

* fix(oci): port schema/type utilities from langchain-oracle reference impl

- Add resolve_oci_schema_refs: inline $ref/$defs — OCI rejects JSON Schema refs
- Add resolve_oci_schema_anyof: flatten Optional[T] anyOf (Pydantic v2 emits these)
- Add sanitize_oci_schema: strip title, normalise null types, ensure array items
- Add OCI_JSON_TO_PYTHON_TYPES: Cohere expects Python type names (str/int/float),
  not JSON Schema names (string/integer/number)
- Add enrich_cohere_param_description: embed enum/format/range/pattern constraints
  into description since CohereParameterDefinition has no dedicated fields
- Apply all of the above in adapt_tool_definitions_to_cohere_standard and
  adapt_tool_definition_to_oci_standard
- Fix toolChoice conversion: map OpenAI string ('auto','none','required') to OCI
  dict form ({"type":"AUTO"} etc.) — the API rejects plain strings
- Update unit test expectations to match correct Python type names and enriched
  descriptions

* refactor(oci): split transformation.py into cohere.py and generic.py

transformation.py was 1 243 lines doing too many jobs. Split along the
same boundaries as the langchain-oracle reference (providers/cohere.py,
providers/generic.py):

  chat/cohere.py   — Cohere message/tool building, response + stream parsing
  chat/generic.py  — Generic message/tool building, response + stream parsing
  transformation.py — thin OCIChatConfig orchestrator + OCIStreamWrapper

Public symbols (OCIChatConfig, OCIStreamWrapper, adapt_messages_to_*,
OCIRequestWrapper, version, …) remain importable from transformation.py
for backward compatibility. OCIStreamWrapper gains delegating shims for
_handle_cohere_stream_chunk and _handle_generic_stream_chunk so existing
test call sites keep working unchanged.

transformation.py: 1 243 → 620 lines

* refactor(oci): principal-level code quality pass

- Remove _extract_text_content duplication — single definition in cohere.py,
  imported where needed; instance method on OCIChatConfig eliminated
- Move cryptography imports to module level with _CRYPTOGRAPHY_AVAILABLE flag
  and _require_cryptography() guard; no more re-import on every signing call
- Move litellm version import to module level via litellm._version; remove
  inline import inside validate_oci_environment
- sign_with_manual_credentials now returns Tuple[dict, bytes] matching
  sign_with_oci_signer — asymmetry eliminated, Optional[bytes] guards removed
  throughout stream wrappers (signed_json_body: bytes = b"")
- Rename _openai_to_oci_cohere_param_map → openai_to_oci_cohere_param_map
  for consistency with openai_to_oci_generic_param_map
- Remove double-key bug in map_openai_params where responseFormat was stored
  under both OCI and OpenAI key names simultaneously
- Remove delegating shims (adapt_messages_to_cohere_standard,
  adapt_tool_definitions_to_cohere_standard, _handle_generic_stream_chunk)
  from OCIChatConfig/OCIStreamWrapper; tests now import directly from
  cohere.py and generic.py where symbols live
- Trim __all__ to 7 genuine public symbols; remove the 13-symbol list that
  existed only to support test imports
- Collapse per-model integration test classes into pytest.mark.parametrize;
  CHAT_MODELS list is the single source of truth for model-specific config
- Black + Ruff clean across all OCI files

* fix(oci): address PR review findings

- types/llms/oci.py: add "TOOL_CALL" to CohereChatResponse.finishReason
  Literal so Pydantic does not raise ValidationError on non-streaming
  Cohere tool-use calls (Greptile P1)
- test_oci_cohere_tool_calls.py: add test covering TOOL_CALL finish reason
- model_prices_and_context_window.json: remove 6 duplicate oci/cohere.embed-*
  keys that were silently overridden by the more complete entries already
  present in the file (Greptile P1)
- common_utils.py: move OCI_API_VERSION here from chat/transformation.py
  so embed/transformation.py does not need to import chat/transformation;
  change Protocol stub body from ... to pass (CodeQL "statement no effect");
  add comment to sha256_base64 clarifying it implements OCI HTTP signing
  spec, not password hashing (CodeQL false positive)
- chat/transformation.py: import CustomStreamWrapper from
  litellm_core_utils.streaming_handler instead of litellm.utils to reduce
  import cycle depth (CodeQL cyclic import)
- chat/cohere.py, chat/generic.py: import Usage and
  ChatCompletionMessageToolCall from litellm.types.utils instead of
  litellm.utils for the same reason
- embed/transformation.py: import OCI_API_VERSION from common_utils
  instead of chat/transformation (removes the embed→chat import edge)

* test(oci): add unit tests to improve patch coverage

- test_oci_common_utils.py (new): covers sha256_base64, build_signature_string,
  OCIRequestWrapper.path_url, resolve_oci_credentials, get_oci_base_url,
  validate_oci_environment, sign_with_oci_signer error paths, sign_oci_request
  routing, load_private_key_from_file error paths, resolve_oci_schema_refs
  (including circular ref and external $ref), resolve_oci_schema_anyof,
  sanitize_oci_schema (all branches), enrich_cohere_param_description
- test_oci_generic_chat.py (new): covers content-message error paths (non-dict
  item, unsupported type, non-string text, invalid image_url), tool-call
  validation error paths, adapt_messages_to_generic_oci_standard error paths,
  handle_generic_response (None message, text content, tool calls),
  handle_generic_stream_chunk (finish reasons, streaming tool calls),
  OCIStreamWrapper non-string chunk error
- test_oci_chat_transformation.py: add error paths for validate_environment
  (empty messages), transform_request (missing compartment_id, Cohere without
  user messages), transform_response (error key), map_openai_params
  (unsupported param with and without drop_params), tool_choice string mapping
- test_oci_cohere_tool_calls.py: add edge cases for stream chunk finish
  reasons (TOOL_CALL, MAX_TOKENS, unknown), _extract_text_content with
  non-dict list items and non-string input,
  adapt_messages_to_cohere_standard with malformed JSON tool arguments

* fix(oci): rename supports_streaming to supports_native_streaming in model prices

The JSON schema for model_prices_and_context_window.json uses
`supports_native_streaming` (not `supports_streaming`) and has
`additionalProperties: false`. Rename the field across all OCI
entries to pass the schema validation test.

* test(oci): add 67 tests targeting uncovered happy paths for coverage

Boost patch coverage on the four lowest-coverage OCI files:
- common_utils.py: sign_with_manual_credentials (oci_key / oci_key_file
  paths), sign_oci_request routing, _require_cryptography
- generic.py: adapt_messages_to_generic_oci_standard (all roles),
  adapt_tool_definition_to_oci_standard, adapt_tools_to_openai_standard,
  handle_generic_stream_chunk text/finish-reason paths
- cohere.py: _extract_text_content, adapt_messages_to_cohere_standard
  (all roles including tool results), handle_cohere_response /
  handle_cohere_stream_chunk all finish-reason branches
- transformation.py: get_vendor_from_model, OCIChatConfig._get_optional_params
  (toolChoice string→dict, responseFormat, tools for both vendors),
  transform_request for GENERIC model, get_sync/async_custom_stream_wrapper
  with mocked HTTP, OCIStreamWrapper.chunk_creator happy paths

* fix(oci): suppress CodeQL false positive on sha256_base64 (OCI HTTP signing, not password hashing)

* fix(oci): remove 6 duplicate model price entries and reconcile conflicting values

Six OCI chat model keys appeared twice in model_prices_and_context_window.json
with conflicting pricing/context data (JSON parsers silently discard the first).
Remove the first-occurrence entries and update the surviving entries:
- meta.llama-4-maverick / llama-4-scout: keep updated entries (free preview
  pricing, larger context windows, vision support)
- meta.llama-3.1-70b: keep original pricing, restore supports_native_streaming
- google.gemini-2.5-{flash,pro,flash-lite}: keep OCI pricing page values,
  restore supports_native_streaming

* fix(oci): route GPT-5 family to maxCompletionTokens

GPT-5 / GPT-5-mini / GPT-5-nano / GPT-5.5 on OCI reject "maxTokens"
with HTTP 400:

  Invalid 'maxTokens': Unsupported parameter: 'maxTokens' is not
  supported with this model. Use 'maxCompletionTokens' instead.

(Same convention as OpenAI's reasoning-API contract.)

Add a model-aware rename in OCIChatConfig._get_optional_params so the
request payload uses maxCompletionTokens when the model id starts with
openai.gpt-5. Regular Llama / Cohere / Gemini / GPT-4.x continue to use
maxTokens unchanged.

Also widen OCIChatRequestPayload to carry the new optional field so it
survives Pydantic serialization.

Verified live against OCI us-chicago-1:
- openai.gpt-5, gpt-5-mini, gpt-5-nano, gpt-5.5 all return 200
- Full feature sweep on gpt-5.5 (basic, system, multi-turn, streaming,
  tools, usage) all green
- meta.llama-3.3-70b-instruct still uses maxTokens (no regression)

4 new unit tests cover the helper, the routing in both pre- and
post-translation states, and Pydantic serialization.

* ci(oci): fix CI failures — black formatting + recursive_detector ignore

- Run black on litellm/llms/oci/common_utils.py + 3 OCI test files
  that drifted out of black-compliance during the rebase.
- Add the three bounded recursive functions in oci/common_utils.py
  (`_resolve`, `resolve_oci_schema_anyof`, `sanitize_oci_schema`) to
  the recursive_detector IGNORE_FUNCTIONS list. All three are bounded:
  `_resolve` uses a `resolving_stack` cycle guard; the other two are
  bounded by JSON-schema tree depth (no cycles in well-formed input),
  matching the pattern of the existing OCI/Vertex schema walkers
  already on the list.

* fix(oci): silence MyPy errors in cohere.py — typed-dict access

Two errors flagged by `lint` CI:

  llms/oci/chat/cohere.py:73:  "object" has no attribute "__iter__"
  llms/oci/chat/cohere.py:119: No overload variant of "get" of "dict"
                               matches argument types "object", "CohereToolCall"

Both stem from `msg.get("tool_calls")` / `msg.get("tool_call_id")`
returning `object` per the AllMessageValues TypedDict union. Bind to
`Any` locally for the iteration and coerce the lookup key with `str()`,
removing the now-unused `# type: ignore` on those lines.

No behaviour change — pure type-narrowing for the type checker.

* fix(oci): silence CodeQL py/weak-sensitive-data-hashing on sha256_base64

CodeQL's taint analysis traces request bodies back to environment-loaded
secrets and flags `hashlib.sha256(body).digest()` as
`py/weak-sensitive-data-hashing` — even though SHA-256 is the algorithm
mandated by the OCI HTTP request signing spec for the
`x-content-sha256` header (not a password/secret hash).

The previous suppression used legacy `# lgtm[...]` syntax which the
modern CodeQL action ignores. Switch to Python's standard
`hashlib.sha256(..., usedforsecurity=False)` (Python 3.9+) which CodeQL
honours as a non-security declaration. Behaviour unchanged.

* feat(oci): add reasoning_effort passthrough — only true missing primitive

OCI's GenericChatRequest exposes a reasoningEffort field
(NONE/MINIMAL/LOW/MEDIUM/HIGH) that's the single biggest cost knob for
reasoning-capable models on the service:

  - GPT-5 family
  - Gemini 2.5
  - Grok reasoning variants (3-mini, 4-fast, 4.20)
  - Cohere Command-A-Reasoning

Setting reasoning_effort=LOW typically cuts reasoning-token spend 5-10×
vs the default. Without exposing this, litellm users had no way to tune
cost-vs-quality on these models.

The other GenericChatRequest fields (verbosity, parallel_tool_calls,
logit_bias, n, metadata, web_search_options, prediction) are not
exposed because they are not missing primitives — they either duplicate
prompt-engineering, framework-level controls, or are too niche to
justify the maintenance surface. We only ship what users genuinely
can't accomplish another way.

Excluded from the Cohere v1 param map: CohereChatRequest has no
reasoningEffort field, and Cohere reasoning models
(cohere.command-a-reasoning) use COHEREV2 which is a separate request
type not covered by this PR.

Verified live: GPT-5.5 + reasoning_effort="HIGH" sends
{"reasoningEffort": "HIGH"} on the wire and OCI accepts the request.

* feat(oci): reasoning_effort + reasoning_tokens for OCI GenAI

Three small additions for OCI reasoning models, requested by users
testing the PR in production fork builds:

1. **reasoning_effort param mapping (GENERIC vendors).** OCI expects
   uppercase levels ("LOW"/"MEDIUM"/"HIGH"/"NONE") on `reasoningEffort`,
   but OpenAI-compatible clients send lowercase. Mapped + uppercased in
   `_get_optional_params`. Marked unsupported on Cohere V1/V2 since OCI
   Cohere has no reasoning models (avoids Pydantic validation failure
   on CohereChatRequest).

2. **"disable" → "NONE" mapping.** OpenAI uses "disable" to turn off
   reasoning; OCI uses "NONE". Without this, callers get a 400.

3. **reasoning_tokens propagated to Usage.** OCI returns
   `completionTokensDetails.reasoningTokens` but it wasn't being passed
   to LiteLLM's Usage object. Now flows through to
   `Usage.completion_tokens_details.reasoning_tokens` so callers can
   track reasoning token consumption for cost/observability.

Tests: 7 new unit tests in TestOCIReasoningEffort covering upper/lower
case, "disable"→"NONE", Cohere drop/raise paths, and reasoning_tokens
extraction (with and without completionTokensDetails). 5 new live
integration tests against xai.grok-3-mini in us-chicago-1 verifying the
full request/response loop end-to-end. Existing
test_transform_response_simple_text assertion that
completion_tokens_details was None has been updated to assert
reasoning_tokens flows through.

Verified live on xai.grok-3-mini: reasoning_effort=low → OCI accepts
"LOW", returns reasoningTokens=316 in usage. reasoning_effort=disable
→ OCI accepts "NONE". Full suite: 370/370 unit + 51/51 integration.

* fix(codeql): re-scope py/weak-sensitive-data-hashing exclusion to OCI signing file

CodeQL's taint analysis re-fires the `py/weak-sensitive-data-hashing`
alert at `litellm/llms/oci/common_utils.py:103` whenever upstream code
paths into the OCI signing module change (touching `transformation.py`
opens new flow paths that CodeQL re-evaluates from scratch). The
`hashlib.sha256(..., usedforsecurity=False)` declaration silences the
direct-call form of the query but not the taint-flow form.

SHA-256 here is mandated by the OCI HTTP signing specification for the
x-content-sha256 content-integrity header — not for password storage:
https://docs.oracle.com/en-us/iaas/Content/API/Concepts/signingrequests.htm

CodeQL has no per-query path filter and GitHub Code Scanning ignores
inline lgtm/codeql comments, so path-ignoring this single ~560-line
signing utility file is the narrowest available suppression. All other
files retain full coverage of py/weak-sensitive-data-hashing — including
litellm/proxy/utils.py where the rule legitimately applies.

This restores the NEUTRAL CodeQL state the PR had on prior commits
(see `2111c98af7` for the same approach on the previous branch
evolution that the cherry-pick was rebased onto a different baseline).

* fix(oci): drop duplicate text on Cohere streaming terminal chunk

OCI Cohere's terminal SSE event re-sends the full assembled response in
`text` alongside a populated `chatHistory`. Emitting that text as another
delta concatenates the entire response onto the already-streamed output
(e.g. "How can I help?How can I help?").

Use `chatHistory is not None` as the discriminator for the consolidated
terminal event — `finishReason` is a weaker signal that could in principle
appear on a non-consolidated chunk. The two coincide today; this preserves
correctness if OCI ever ships finishReason on an incremental chunk.

Adds a live-OCI integration regression test that compares streamed vs
non-streamed length and asserts the response prefix appears only once.
Verified to fail under the previous code with the exact reported
reproduction: 'Hello! How can I help you today?Hello! How can I help you today?'.

Reported by @gotsysdba on PR #25177.

* fix(oci): buffer SSE stream across HTTP read boundaries

The old split_chunks helper split each individual HTTP read on "\n\n",
which assumed SSE event boundaries always aligned with read boundaries.
In practice the OCI streaming endpoint delivers events that may:

  - straddle two reads (chunk_creator gets a truncated JSON and crashes)
  - arrive separated by a single "\n" instead of "\n\n"
  - share a read with multiple complete events

Replace the inline split with module-level helpers _iter_sse_events
(sync) / _aiter_sse_events (async) that maintain a buffer across reads,
split on any newline, and yield only complete "data:" lines.

Add 25 regression tests covering event-split-across-reads, tiny-chunk
reads, single-newline separators, keepalive/comment lines, trailing
partial events flushed at EOF, "\r\n" line endings, and an end-to-end
smoke test that feeds an awkwardly-chopped payload through the splitter
into OCIStreamWrapper.chunk_creator.

Reported by John Lathouwers.

* test(oci): repoint TestOCIKeyNormalization to sign_with_manual_credentials

The signing helper moved from OCIChatConfig._sign_with_manual_credentials
to a module-level sign_with_manual_credentials in common_utils.py. Four
tests in TestOCIKeyNormalization still called the old method:

  - 2 failed outright with AttributeError
  - 2 passed by accident because they used pytest.raises(Exception),
    which happily caught the AttributeError instead of exercising the
    intended OCIError path

Repoint all four to the new module-level function so they exercise the
actual oci_key type-validation branch.

* fix(oci): validate oci_region before URL interpolation to prevent SSRF

Anchor oci_region to ^[a-z][a-z0-9-]{0,30}[a-z0-9]$ inside get_oci_base_url
so user-supplied regions that would redirect the signed request to an
attacker-controlled host (e.g. 'evil.com/#') fail with HTTP 400 before
the URL or signature is built. Empty string still falls back to the
us-ashburn-1 default, so existing callers are unaffected.

* test(audio): skip when gpt-4o-audio-preview is unavailable upstream

OpenAI retired `gpt-4o-audio-preview` (404 model_not_found in CI as of
2026-05-19), and the existing try/except in these tests only re-raised
on 'openai-internal' errors. Other exceptions were silently swallowed,
so the next line ran with an unbound `response`/`completion` and
failed with an unrelated UnboundLocalError that masked the real cause.

Extend the skip condition to also cover model_not_found / 'does not exist'
so the suite reports the upstream outage cleanly, matching the pattern
used in ce87c41 for the realtime and nvidia_nim rerank tests.
Re-raise unknown exceptions instead of falling through.

* fix(oci/router): catalog-driven maxCompletionTokens; generic blocked-deployment message

- Drive OCI maxCompletionTokens via supports_reasoning from the model
  catalog instead of a hardcoded openai.gpt-5 prefix. Add OCI GPT-5 family
  entries (gpt-5, gpt-5-mini, gpt-5-nano) with supports_reasoning: true.
  Gate the override to non-Cohere vendor so Cohere reasoning models keep
  maxTokens (Cohere endpoint does not accept maxCompletionTokens).
- Replace proxy-specific 'Contact your proxy admin' phrasing in the four
  Router blocked-deployment ServiceUnavailableError messages with neutral
  SDK-appropriate text.

* fix(oci/cohere): guard handle_cohere_response against missing usage

* fix(oci): address bug review findings in chat transformation

- Cohere param map: keep tool_choice/n as False (not omitted) so unsupported
  params are dropped or rejected rather than silently passed through.
- get_complete_url: when an explicit api_base/litellm.api_base is provided,
  use it as-is instead of unconditionally appending /20231130/actions/chat
  (mirrors the embed config behavior).
- Cohere stream: require both chatHistory and finishReason to be present to
  identify a terminal consolidation chunk, avoiding silent text suppression
  if chatHistory ever appears on a non-terminal chunk.
- Generic usage: use 'is not None' for reasoningTokens so a legitimate value
  of 0 is preserved instead of being treated as absent.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci/cohere): emit tool calls in streaming and null content when text empty

handle_cohere_response now sets message.content to None when the Cohere
response text is empty, matching the OpenAI convention for tool-call-only
responses.

handle_cohere_stream_chunk now extracts toolCalls — both directly from
the chunk and from the terminal chunk's chatHistory CHATBOT message —
and emits them in the delta. Previously, CohereStreamChunk lacked a
toolCalls field, so any tool calls in the stream were silently dropped.

* fix(oci): preserve tool results, embed URL path, and generic finish reason

- Use SerializeAsAny on CohereChatRequest.chatHistory so subclass-specific
  fields like CohereToolMessage.toolResults are not dropped during Pydantic
  v2 serialization.
- Make OCIEmbedConfig.get_complete_url append the /20231130/actions/embedText
  action path consistently with chat, so setting litellm.api_base to the
  region inference base URL no longer posts to the bare hostname.
- Map OCI finishReason (COMPLETE / MAX_TOKENS / TOOL_CALLS) to OpenAI
  finish_reason values in handle_generic_response, mirroring the streaming
  handler and the Cohere non-streaming handler.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci/generic): silence mypy assignment error on dynamic finish_reason

* fix(oci/embed): always set usage on embedding response

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci/chat): append /20231130/actions/chat to explicit api_base

Restore the embed-style behavior so OCIChatConfig.get_complete_url always
appends the OCI GenAI chat path. Routing through get_oci_base_url ensures the
optional explicit api_base has its trailing slash stripped before the suffix is
joined, matching the embed config and the test_respects_explicit_api_base
expectation.

* fix(oci/cohere): mark logprobs/logit_bias unsupported and normalize unknown stream finish reasons

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci/cohere): preserve trailing tool result in chatHistory

When the last message in the OpenAI-format input is a tool result (the
standard agentic continuation pattern), the prior messages[:-1] slice
silently dropped that tool result from chatHistory and the model never
saw it. Excluding the last user message by index instead keeps tool
results that trail the last user turn intact.

* fix(main): remove dead OCI embedding elif block

The earlier elif at line 5119 already routes OCI embeddings through the
base HTTP handler with the headers None-guard, so the later identical
block was unreachable dead code.

* test(oci): move integration tests out of llm_translation mock-only folder

Greptile flags tests/llm_translation/ as mock-only via a project-specific
rule; relocate the live-network OCI integration suite to tests/integration/
and adjust the in-file sys.path / run instructions accordingly.

* fix(oci/cohere): suppress tool calls on stream terminal consolidation chunk

The terminal SSE event re-sends the full assembled response in both
`text` and `chatHistory`. The existing logic already suppresses
`text` to avoid double-emit, but tool calls extracted from the
terminal chunk (via `typed_chunk.toolCalls` or the `chatHistory`
CHATBOT fallback) would still be re-emitted with fresh uuid4 IDs.
If OCI Cohere ever streams tool calls progressively in intermediate
chunks (now possible since CohereStreamChunk has a toolCalls field),
this would cause downstream agentic frameworks to execute each tool
call twice.

Suppress tool calls on the terminal consolidation chunk for the same
reason `text` is suppressed.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci,httpx): normalize finish_reason, preserve response_format, fix sync embed JSON content-type

- cohere.py / generic.py: normalize unknown OCI finishReason values (ERROR,
  ERROR_TOXIC, CONTENT_FILTERED, USER_CANCEL, ...) to 'stop' in non-streaming
  and streaming generic handlers, matching the streaming Cohere handler so
  downstream consumers switching on finish_reason aren't broken by raw OCI
  values.
- transformation.py: restore the dual-key alias so optional_params still
  carries the original 'response_format' key alongside the OCI-mapped
  'responseFormat'. Downstream litellm framework code (json_mode detection,
  logging) inspects 'response_format' after map_openai_params runs.
- llm_http_handler.py: make the sync embedding path mirror the async path —
  when sign_request returns no signed_body, send via json=data (which sets
  Content-Type: application/json) instead of data=json.dumps(data) which
  doesn't. Removes a sync/async behavioural asymmetry for non-OCI providers
  that adopt the sign_request pattern.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): clean up OCIChatConfig init, normalize generic stream finish reasons, correct embed sign_request return type

- Replace fragile setattr(self.__class__, ...) pattern in OCIChatConfig.__init__ with a @property for has_custom_stream_wrapper, matching the pattern used by other providers.
- Normalize unknown OCI finish reasons (e.g. ERROR, ERROR_TOXIC, USER_CANCEL) to 'stop' in handle_generic_stream_chunk, matching the existing Cohere stream handler behaviour.
- Tighten OCIEmbedConfig.sign_request return type from Tuple[dict, Optional[bytes]] to Tuple[dict, bytes] — sign_oci_request never returns None for the body, and this matches OCIChatConfig.sign_request.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): strip trailing action path in get_oci_base_url to avoid URL doubling

A fully-formed OCI endpoint URL (e.g. https://inference.generativeai.us-chicago-1.oci.oraclecloud.com/20231130/actions/chat) passed via api_base previously had the action path appended a second time by get_complete_url in both chat and embed configs, yielding a 404. get_oci_base_url now strips a trailing /20231130/actions/<name> so callers can always append the action path safely.

* fix(httpx): preserve sync embed data= kwarg to avoid breaking mock-based tests

The earlier sync_httpx_client.post() call passed data=json.dumps(data),
which downstream embedding tests assert on (e.g. tests for hosted_vllm,
jina_ai, watsonx). Switching to json=data changed the kwarg name and broke
those tests. The OCI signed_body path keeps using data=signed_body and is
unaffected.

* fix(oci): stable tool-call ids across stream chunks; lenient Cohere finishReason

- Replace random uuid4 per chunk with a deterministic content-derived
  digest for synthetic tool-call ids in both Cohere and Generic OCI
  handlers. Previously, when OCI omitted 'id' (always for Cohere, often
  for Generic streaming deltas), every chunk for the same logical tool
  call received a new uuid, causing downstream stream-mergers (which key
  off id) to treat each fragment as a distinct call.

- Relax CohereChatResponse.finishReason from a strict Literal[...] to
  Optional[str], matching CohereStreamChunk.finishReason. The
  handle_cohere_response 'elif oci_finish_reason is not None' fallback
  was previously unreachable because Pydantic raised ValidationError on
  any unknown value before the fallback executed. Now non-streaming
  responses degrade unknown reasons to 'stop' just like the streaming
  path.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci/embed): validate OCI credentials in validate_environment

Mirror OCIChatConfig.validate_environment so embedding requests fail
fast with a clear error when oci_user/oci_fingerprint/oci_tenancy/
oci_compartment_id or an oci_key/oci_key_file is missing, instead of
deferring the failure until sign_request.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* test(oci/embed): expect OCIError from validate_environment when credentials are missing

OCIEmbedConfig.validate_environment now raises eagerly (mirroring OCIChatConfig)
when oci_user/oci_fingerprint/oci_tenancy/oci_compartment_id or oci_key/oci_key_file
is missing. Update the test to match.

* fix(oci): polish stream chunk handling and signed body default

- cohere stream terminal consolidation now emits content=None instead of ""
- drop redundant index truthiness check (None is already replaced with 0)
- accept both "TOOL_CALL" and "TOOL_CALLS" finish reasons in cohere
- signed_json_body defaults to None and uses explicit None check, so an
  explicitly empty bytes body wouldn't be silently re-serialized

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci/chat): catch pydantic ValidationError when parsing OCI responses

Pydantic v2 raises ValidationError (not TypeError) when field validation
fails, so malformed OCI completion responses or stream chunks would
propagate unhandled out of handle_generic_response,
handle_generic_stream_chunk, and handle_cohere_stream_chunk. Widen the
except clauses to also catch ValidationError so callers get a clean
OCIError.

* fix(oci/catalog): real prices for Llama 4, drop zero-cost OCI OpenAI entries

Zero-cost catalog entries (input_cost_per_token=0, output_cost_per_token=0)
make proxy spend tracking silently report $0 for these paid OCI models, so
any caller can drive them without decrementing a budget.

For Llama 4 Maverick and Scout, OCI charges the same character-based rate
as Llama 3.3 70B ($0.0018 per 10,000 characters), so use the same per-token
price as the existing oci/meta.llama-3.3-70b-instruct entry (7.2e-07 in/out).

For oci/openai.gpt-5, gpt-5-mini, gpt-5-nano, gpt-oss-120b, and gpt-oss-20b,
no public per-token pricing is available; drop the entries so operators must
register them with explicit custom pricing. The existing GPT-5 reasoning test
fixture already injects synthetic entries when the catalog omits them, so the
chat transformation's supports_reasoning lookup keeps working in tests.

* fix(oci/chat): wrap CohereChatResult construction in try/except

Match the handle_generic_response pattern: surface OCIError with the
upstream status code instead of letting a raw pydantic.ValidationError
propagate when the Cohere response payload is malformed.

* fix(oci): harden Cohere stream/finish-reason and dedupe maxTokens param mapping

- Cohere stream: track per-stream tool-call emission and only suppress the
  terminal consolidation chunk's tool calls once they've been seen earlier.
  Prevents silent drop if tool calls are delivered exclusively on the
  terminal chunk.
- Cohere stream: emit content=None (not "") on non-terminal text-free
  chunks (e.g. tool-call-only / keep-alive) so downstream consumers that
  distinguish missing vs explicitly-empty deltas behave correctly.
- Generic handlers: accept singular TOOL_CALL finish reason in addition to
  TOOL_CALLS, matching the Cohere handlers.
- _get_optional_params: when both max_tokens and max_completion_tokens are
  provided, explicitly prefer max_completion_tokens instead of relying on
  dict iteration order.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): emit content=None instead of empty string for text-free generic stream chunks

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* test(oci): expect content=None for text-free generic stream chunks

handle_generic_stream_chunk now emits content=None instead of empty
string when a chunk carries no text parts. Update the corresponding
no-message test to match.

* codeql: narrow OCI sha256 suppression to query-filter, not whole file

paths-ignore was suppressing every CodeQL query on
litellm/llms/oci/common_utils.py, hiding all future findings in a
security-critical file (private key loading, credential resolution,
URL construction, RSA signing). Move the suppression for
py/weak-sensitive-data-hashing into query-filters so common_utils.py
remains fully analyzed by every other query.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): use locale-independent RFC 7231 date for manual signing

email.utils.formatdate(usegmt=True) emits canonical English weekday/
month abbreviations regardless of system locale, so signature
verification doesn't break on non-en_US deployments.

* fix(oci): strip 'oci/' prefix in get_vendor_from_model

Previously, get_vendor_from_model split on '.' without stripping the
optional 'oci/' provider prefix, so 'oci/cohere.command-a-03-2025' was
routed through the GENERIC pipeline instead of COHERE.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* codeql: scope OCI sha256 suppression to common_utils.py via filter-sarif

Replace the global query-filters exclude for py/weak-sensitive-data-hashing
with a SARIF post-filter that only drops the alert when it originates from
litellm/llms/oci/common_utils.py, keeping the rule active on every other
SHA-256 callsite in the repository.

* Fix OCI chat bugs: tool_calls None key, dead max_tokens dedup, single-event stream text suppression

- handle_cohere_response: omit tool_calls key from message dict when None,
  matching the generic handler's behaviour and avoiding tripping consumers
  that key off 'tool_calls' in message.
- _get_optional_params: remove dead prefer_max_completion branch. By the
  time this helper runs, map_openai_params has already collapsed
  max_tokens/max_completion_tokens onto the OCI alias, so the OpenAI-key
  membership check is unreachable.
- handle_cohere_stream_chunk: add prior_text_emitted parameter mirroring
  prior_tool_calls_emitted. The terminal consolidation chunk's text is
  only suppressed when prior deltas already emitted text — otherwise
  (degenerate single-event stream) the text passes through so the
  response content isn't silently lost. OCIStreamWrapper now tracks
  emitted text alongside emitted tool calls.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): preserve all text parts in generic response and emit SYSTEM role for Cohere

- handle_generic_response: iterate all content parts and concatenate text
  (matches the streaming handler) so non-leading text parts are not lost
  and a leading non-text part does not suppress trailing text.
- adapt_messages_to_cohere_standard: emit CohereSystemMessage for system
  messages so direct callers do not silently drop them. The Cohere
  request builder filters system messages before calling this helper to
  avoid duplicating preambleOverride content into chatHistory.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): normalise dict-format tool_choice to OCI flat uppercase shape

The OCI Generative AI API only accepts toolChoice values of the form
{"type": "AUTO"|"NONE"|"REQUIRED"} or {"type": "FUNCTION",
"name": "<fn>"}. The previous conversion only handled string
tool_choice values, so OpenAI's standard dict shape
{"type": "function", "function": {"name": "<fn>"}} passed
through unchanged and was rejected by OCI with a 400.

Normalise the dict shape by uppercasing the discriminator and hoisting
the function name to the top level. Also accept dict variants of the
non-function selectors (e.g. {"type": "auto"}).

* test(oci): exercise system-message filtering at transform_request boundary

adapt_messages_to_cohere_standard now emits SYSTEM-role entries by design
so direct callers don't silently drop system content. The Cohere request
builder filters system messages before calling the helper and routes them
into preambleOverride, so the user-visible 'no SYSTEM in chatHistory'
guarantee holds at the transform_request boundary, where the test should
live.

* fix(oci/chat): extract tool_choice/response_format helpers to satisfy PLR0915

_get_optional_params exceeded ruff's 50-statement cap. The toolChoice and
responseFormat normalisation blocks are self-contained mutations, so move
them to module-level helpers.

* fix(oci): normalize None finishReason in generic non-streaming handler; drop dead Cohere system-role branch

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci/generic): silence mypy assignment error on cleared finish_reason

* fix(docker): install libatomic in builder for prisma nodeenv binary

The prebuilt node binary that prisma-python's nodeenv downloads links
against libatomic.so.1, which Wolfi does not pull in via gcc/nodejs.
Without this, fresh Docker builds (no GHA cache hit) fail at
`prisma generate` with:
  node: error while loading shared libraries: libatomic.so.1

* fix(oci): raise on invalid tool_choice instead of silently passing OpenAI shape

_normalize_tool_choice previously left an OpenAI-format dict in selected_params['toolChoice'] when the type was unrecognized or when 'FUNCTION' was given with a missing/empty name. OCI would then reject the request with a non-obvious error. Raise ValueError with a clear message in these cases.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): raise OCIError instead of ValueError in _normalize_tool_choice

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci/generic): declare non-security intent on sha256 for synthetic tool-call id

* fix(oci): simplify _get_optional_params and reject invalid tool_choice types

- Collapse the two-loop _get_optional_params into a single pass with
  clear precedence (OpenAI key wins over OCI alias; first OpenAI key
  reaching a given OCI target wins). Removes the redundant maxTokens
  special-case in the second loop and makes the map_openai_params /
  transform_request handoff easier to reason about.
- Raise OCIError when _normalize_tool_choice sees an unexpected type
  (list, bool, int, ...) instead of silently letting it through to the
  OCI API where it would produce an opaque server-side error.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* Remove no-op data['stream'] deletion in OCI stream wrappers

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): always send Cohere isStream field explicitly

Match OCIChatRequestPayload by defaulting CohereChatRequest.isStream to
False instead of None so model_dump(exclude_none=True) does not silently
omit the field on non-streaming requests.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): revert Cohere isStream to Optional[bool]=None to preserve omission semantics

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci/generic): raise OCIError on empty choices instead of IndexError

Pydantic accepts an empty choices list when validating OCICompletionResponse, so accessing chatResponse.choices[0] could raise an unhandled IndexError. Surface it as OCIError so the response error path is consistent with the existing (TypeError, ValidationError) guard.

* fix(oci/cohere): map top_k -> topK so Cohere topK param is settable

The Cohere param map (derived from the GENERIC map) had no entry for
topK. Since the simplified _get_optional_params only iterates over
param_map entries, callers had no way to pass topK to CohereChatRequest
(neither via an OpenAI-style key nor via the OCI alias).

Add 'top_k': 'topK' to the Cohere map only — OCIChatRequestPayload
(GENERIC) has no topK field. _get_optional_params accepts both the
OpenAI key (top_k) and the OCI alias (topK) in optional_params, so this
covers both calling conventions.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): tighten cohere stream dedup flags and forward stream args in embed signing

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci/chat): reorder dict guard and wrap stream chunk json.loads

- Move isinstance(response_json, dict) check before .get("error") so
  the guard runs before the attribute access it is supposed to protect.
- Wrap json.loads in OCIStreamWrapper.chunk_creator with try/except so
  malformed SSE payloads surface as OCIError instead of a raw
  JSONDecodeError propagating out of the stream loop.

* fix(oci/cohere stream): only flag text emitted on non-empty content

An intermediate Cohere SSE chunk carrying text="" was flipping
_cohere_text_emitted via the "is not None" check, which then caused
the terminal consolidation chunk to drop its real text as a duplicate.
Use a truthy check so only actual content marks the stream as having
emitted text.

* test(oci): end-to-end proxy integration test against real OCI GenAI

Spins up the litellm proxy via the console-script entrypoint with a
minimal OCI-only config and drives real OpenAI-shaped HTTP requests
through it against OCI GenAI. Covers non-streaming chat, streaming
chat, embeddings, and /v1/models for Cohere, Llama, Gemini, and Grok.

Skips automatically when ~/.oci/config is absent or when the active
profile uses session-token auth (the OCI provider currently only
consumes OCI_* env vars; session tokens would need an in-process
signer). API-key profiles work out of the box.

* test(oci): move proxy integration test to tests/integration/

tests/llm_translation/ is mock-only; the OCI proxy integration test
spawns a real proxy subprocess and makes live HTTP calls, so move
it (and the companion config) to tests/integration/ alongside the
existing test_oci_integration.py.

* fix(oci): dedupe finish-reason mapping and batch Cohere tool results

- Extract _normalize_oci_finish_reason helper so the four chat handlers
  (Cohere/GENERIC, sync/stream) share one OCI->OpenAI mapping instead of
  four near-identical if/elif chains.
- Merge consecutive OpenAI tool-role messages into a single
  CohereToolMessage with multiple toolResults entries, matching the OCI
  Cohere API's expectation for parallel tool calls in one assistant turn.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): drop dead Cohere toolChoice field and emit GENERIC tool-call dicts inline

- Remove the unreachable toolChoice field from CohereChatRequest. The
  Cohere param map explicitly marks tool_choice as unsupported, so the
  field can never be populated through the normal optional_params flow
  and only confused the public model surface.
- Build GENERIC stream tool-call dicts inline (id/type/function shape)
  instead of round-tripping through ChatCompletionMessageToolCall and
  model_dump(). Matches handle_cohere_stream_chunk so downstream
  stream-mergers see the same minimal payload regardless of which
  vendor produced the chunk.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(docker): drop redundant libatomic from non_root builder

litellm_internal_staging already fixes the prisma `nodeenv` build
failure at the root cause by restoring `npm` to the builder (#28519):
with npm on PATH, prisma-python uses the system Node and never downloads
the nodeenv binary that links against libatomic.so.1. After merging
internal_staging the libatomic line is dead weight, so remove it.

https://claude.ai/code/session_01SwKzxRxgUhLFyyEf4UV812

* fix(oci/catalog): add openai.gpt-5{,-mini,-nano} entries with supports_reasoning

Without these catalog entries, supports_reasoning(model='openai.gpt-5*',
custom_llm_provider='oci') returned False, so _model_uses_max_completion_tokens
fell back to the default and OCI rejected the request with HTTP 400
('Use maxCompletionTokens instead.'). Add the three entries so the catalog-driven
maxCompletionTokens routing works against a stock LiteLLM install.

Also reword the test fixture docstring — the bundled backup now actually ships
these entries, so the fixture is only a fallback for environments that loaded
their cost map from a stale remote source.

---------

Co-authored-by: Tai An <antai12232931@outlook.com>
Co-authored-by: Vincent <yimao1231@gmail.com>
Co-authored-by: Kris Xia <xiajiayi0506@gmail.com>
Co-authored-by: d 🔹 <liusway405@gmail.com>
Co-authored-by: Fabrizio Cafolla <developer@fabriziocafolla.com>
Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com>
Co-authored-by: Tom Denham <tom@tomdee.co.uk>
Co-authored-by: escon1004 <70471150+escon1004@users.noreply.github.com>
Co-authored-by: Divyansh Singhal <97736786+Divyansh8321@users.noreply.github.com>
Co-authored-by: robin-fiddler <robin@fiddler.ai>
Co-authored-by: Michael-RZ-Berri <michael@berri.ai>
Co-authored-by: Michael Riad Zaky <michaelr@Mac.localdomain>
Co-authored-by: ryan-crabbe-berri <ryan@berri.ai>
Co-authored-by: Krrish Dholakia <krrish+github@berri.ai>
Co-authored-by: Sameer Kankute <sameer@berri.ai>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Federico Kamelhar <federico.kamelhar@oracle.com>
Co-authored-by: Claude <noreply@anthropic.com>
2026-05-23 12:15:41 -07:00
milan-berri
7270f723de
fix(mcp): forward upstream initialize instructions on cold gateway init (#28231)
Prefetch upstream InitializeResult.instructions before merging gateway
initialize options when YAML/DB do not set instructions, so clients receive
upstream server text on the first MCP initialize without list_tools.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 16:00:47 -07:00
Sameer Kankute
f35e7eb2f6
feat(guardrails): add Microsoft Purview DLP guardrail (#24966)
* feat(guardrails): add Microsoft Purview DLP guardrail

* fix(guardrails/purview): raise_for_status on HTTP errors, cap scope cache, reuse executor

* fix(guardrails/purview): propagate litellm_call_id as correlation_id to Purview

* chore: fixes

* refactor(guardrails): delegate get_user_prompt to get_last_user_message

PurviewGuardrailBase duplicated AzureGuardrailBase (and OpenAIGuardrailBase)
user-prompt extraction. The same logic already lived in
common_utils.get_last_user_message; wire guardrail bases to that helper,
fix the helper docstring, and drop its redundant self-import of
convert_content_list_to_str.

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* fix(purview): make protection scope cache true LRU on hits

OrderedDict.get() does not update insertion order; call move_to_end on
TTL-valid cache hits so popitem(last=False) evicts least-recently-used
users instead of FIFO by first insert.

Add a regression test with a small max cache size.

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* Fix mypy

* fix(guardrails/purview): harden user-id resolution and broaden DLP text

Prefer API key and proxy-injected metadata over client metadata for Entra
identity. Scan full message transcript pre-call and all completion choices
post-call. Align logging-only hook with the same user-id rules.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(guardrails/purview): scan /v1/completions prompt and TextChoices

Normalize text-completion prompts (string or list of strings); skip token-id-only
prompts. Run post-call DLP on TextCompletionResponse choices. Extend logging_only
hook for text_completion. Add tests and completion_prompt_to_str helper.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(purview-dlp): return data after DLP pass; per-call executor; dedupe text extraction

async_pre_call_hook now returns the request dict after a successful check so
callers match skip-path behavior. logging_hook uses a fresh ThreadPoolExecutor
per invocation like Presidio to avoid single-worker starvation. Response text
extraction is centralized in _completion_response_text_parts.

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* fix(purview): fix LRU cache refresh position and add Responses API scanning

Two fixes to the Microsoft Purview DLP guardrail:

1. LRU cache bug (base.py): When a stale scope cache entry was re-fetched,
   the assignment  updated the value but
   Python's OrderedDict.__setitem__ preserves the original insertion order for
   existing keys. This left the refreshed entry near the front of the dict,
   making it the first candidate for LRU eviction via popitem(last=False).
   Fix: call move_to_end(user_id) after every write to an existing key.

2. Responses API coverage gap (purview_dlp.py): Requests to /v1/responses use
   an 'input' field instead of 'messages' or 'prompt', so the pre-call hook
   returned without scanning the content. Similarly, post-call hook did not
   handle ResponsesAPIResponse.output. Fix: add _responses_api_input_to_str()
   helper and handle 'responses'/'aresponses' call types in async_pre_call_hook,
   async_post_call_success_hook (via _completion_response_text_parts), and
   async_logging_hook.

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* fix(purview): message separator, non-blocking logging_hook, TextChoices type error

Three bugs fixed in the Microsoft Purview DLP guardrail:

1. get_prompt_text_for_dlp message separator (base.py)
   - Previously called get_str_from_messages() which concatenated all message
     texts with NO separator, so 'end of msg1' + 'start of msg2' became
     'end of msg1start of msg2'.
   - Now joins per-message text with '\n\n' via convert_content_list_to_str(),
     preserving DLP pattern detection accuracy across message boundaries.

2. logging_hook blocking the event loop thread (purview_dlp.py)
   - Previously called future.result() which blocked the calling thread
     (often the event loop thread) for the entire round-trip of two sequential
     Microsoft Graph API calls (_compute_protection_scopes + _process_content).
   - Now fires and forgets: when called inside a running loop, schedules the
     coroutine with loop.create_task(); otherwise spawns a daemon thread.
     Returns (kwargs, result) immediately in both cases.
   - Removes unused concurrent.futures.ThreadPoolExecutor import; adds threading.

3. Incompatible assignment type error (purview_dlp.py:180)
   - mypy inferred 'choice' as TextChoices from the first loop body, then
     flagged the assignment in the second loop as incompatible with Choices.
   - Fixed by using distinct loop variable names: text_choice (TextChoices) and
     chat_choice (Choices).

Tests: 7 new tests added covering the separator fix (TestGetPromptTextForDlp)
and the non-blocking logging_hook (TestLoggingHookNonBlocking).

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* fix(purview): suppress API errors in logging-only mode and scan tool-call arguments

Three issues fixed:

1. _check_content except block re-raised unconditionally even when
   block_on_violation=False. The docstring promised 'log only - do not
   raise' but network/API errors always propagated. Fixed by checking
   block_on_violation before re-raising; when False, log a warning and
   continue.

2. async_logging_hook used a single try/except wrapping both the prompt
   and response audit calls. When the first _check_content (uploadText)
   raised due to an API error the second call (downloadText) was silently
   skipped. Fixed by giving each audit call its own try/except so both
   always run independently.

3. convert_content_list_to_str() only reads message.content, so
   tool_calls[].function.arguments and function_call.arguments were
   invisible to the Purview pre-call and post-call scans. An authenticated
   caller could embed sensitive text in tool-call arguments and bypass DLP.
   Fixed by:
   - Adding PurviewGuardrailBase._extract_tool_call_args_from_message()
     which handles both dict and object-style messages, covering both
     tool_calls[] arrays and the legacy function_call field.
   - Updating get_prompt_text_for_dlp() to include those arguments
     alongside message content (request/prompt path).
   - Changing _completion_response_text_parts() from @staticmethod to an
     instance method and adding tool-call argument extraction for
     ModelResponse choices (response path).

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* chore(ui): restructure pre-built Next.js output to directory-based routing

Flat page files (e.g. guardrails.html) replaced by directory-based
index.html equivalents (e.g. guardrails/index.html) matching the
Next.js App Router output format.

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* fix(purview): comprehensive security hardening — identity spoofing, streaming bypass, token-id gap

Four security issues addressed:

1. end_user_id kwargs fallback missing in _resolve_user_id_from_logging_kwargs
   user_id already fell back to kwargs.get("user_api_key_user_id") when absent
   from metadata, but end_user_id only checked md.get("user_api_key_end_user_id")
   with no kwargs-level fallback. Added or kwargs.get("user_api_key_end_user_id").

2. Streaming responses bypassed post_call blocking
   async_post_call_success_hook only runs on assembled non-streaming responses.
   For streaming requests the proxy already delivered all content before the
   hook ran, so raising HTTPException there had no effect. Added
   async_post_call_streaming_iterator_hook which buffers the entire stream,
   assembles it via stream_chunk_builder, runs the Purview DLP check, and only
   then re-yields chunks via MockResponseIterator. If a violation is detected the
   exception is raised before any bytes reach the client. The proxy automatically
   skips async_post_call_success_hook for guardrails that define this method,
   preventing duplicate scans.

3. Caller-controlled Purview user identity in blocking modes
   When a LiteLLM API key has no bound user_id the guardrail fell back to
   metadata[user_id_field], which is supplied by the caller. A caller could set
   this to any Entra object ID whose Purview policies are more permissive and
   bypass DLP. Added _resolve_trusted_user_id() that only returns identities
   from the proxy auth system (user_api_key_dict.user_id, end_user_id, or
   proxy-injected metadata["user_api_key_user_id"]). Added
   _resolve_user_id_for_blocking() used by all blocking-mode hooks: tries
   trusted sources first; if only caller-supplied is available, logs a
   SECURITY WARNING and still proceeds (backward compat); if nothing resolves,
   skips with a warning.

4. Token-id prompt DLP bypass
   When /v1/completions received a pure token-id array prompt,
   completion_prompt_to_str() returned None and the pre_call hook silently
   skipped the Purview scan. An authenticated caller could tokenize blocked
   text and send it without DLP evaluation. The hook now detects this case
   (raw_prompt present but prompt_text None) and logs a WARNING while letting
   the request pass through — token-id payloads are opaque at the text layer
   and cannot be scanned. This makes the gap explicit rather than silent.

Tests: 94 total, all passing.

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* Revert "chore(ui): restructure pre-built Next.js output to directory-based routing"

This reverts commit c70c4303b735bb3885732bd4a0e01997e9571f56.

* fix(purview): fail closed on identity spoofing, token prompts, and path encoding

Encode Entra user IDs in Graph paths, guard caches with asyncio.Lock, scan
Responses API instructions with string input, reject caller-only metadata and
token-id completion prompts in blocking mode, and revert unrelated UI HTML
restructure from the PR branch.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(purview): use threading.Lock and getattr for LitellmParams

- Replace asyncio.Lock with threading.Lock in PurviewGuardrailBase.
  The cache lock is acquired both from the proxy's main event loop and
  from short-lived event loops created by the logging_hook thread
  fallback. In Python 3.10+ an asyncio.Lock is bound to the first event
  loop that acquires it, so the second loop would silently break audit
  logging with RuntimeError. All critical sections are in-memory dict
  ops with no awaits, so a synchronous lock is safe.

- Use getattr() on LitellmParams in initialize_guardrail() instead of
  .get(), which does not exist on Pydantic BaseModel instances and
  would raise AttributeError at runtime. Tests updated to construct
  Mock objects with spec= so they reflect the real interface.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* refactor(purview): dedupe trust-level user resolution and drop dead code

- _resolve_user_id now delegates levels 1-3 to _resolve_trusted_user_id
  so blocking and non-blocking paths share a single source of truth.
- Drop redundant event_hook override in MicrosoftPurviewDLPGuardrail.__init__
  (initialize_guardrail already forwards event_hook=litellm_params.mode).
- Drop unused self._logging_only attribute; blocking is controlled by the
  block_on_violation argument passed to _check_content.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(purview): fail-closed on responses API transform error; avoid duplicate audit calls

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(purview): fail-closed blocking DLP; revert directory-based UI HTML

Blocking hooks now require UserAPIKeyAuth user_id/end_user_id only (no
spoofable metadata), re-raise Responses API transform errors, scan streamed
text completions, and reject requests with no bound identity. Reverts the
accidental directory-based Next.js output from cc47081 (c70c4303b7).

Co-authored-by: Cursor <cursoragent@cursor.com>

* Remove dead code in purview_dlp: _resolve_user_id_for_blocking never returns falsy

The method either returns a non-empty trusted user id or raises HTTPException,
so the 'if not user_id' guards in async_pre_call_hook and async_post_call_success_hook
were unreachable. Tighten the return type to str and drop the dead checks to
make the fail-closed behavior explicit.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(purview): exclude caller-controlled end_user_id from blocking DLP

Blocking Purview checks now use only API-key/JWT-bound user_id, not
end_user_id populated from request user/metadata/safety_identifier.

Co-authored-by: Cursor <cursoragent@cursor.com>

* style(purview): apply Black formatting to base.py

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(purview): use post-await timestamp for cache TTL

Capture the timestamp after the network call completes when storing it
as the cache freshness marker, so the effective TTL reflects when the
response was actually received rather than when the request started.
Under high network latency the previous behavior shortened the
effective cache lifetime.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(purview_dlp): fail closed when stream_chunk_builder returns None

stream_chunk_builder can return None (e.g., when ChunkProcessor filters
all chunks), causing both isinstance checks to fail and the buffered
chunks to be released without DLP scanning. Explicitly fail closed in
that case by raising an HTTPException so the streaming DLP guardrail
does not bypass policy enforcement.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(purview_dlp): resolve user_id before buffering stream

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* merge main (#28629)

* test(vcr): classify cache verdicts, detect live calls, surface cost leaks

Convert the per-test VCR verdict line from a single 'NOOP / HIT / MISS /
PARTIAL' tag into a classified outcome that distinguishes the cases that
silently bill the live API on every CI run from the ones that don't:

  HIT                         pure replay
  PARTIAL                     mixed replay + new recordings
  MISS:RECORDED               new cassette saved to Redis (cached next run)
  MISS:OVERFLOW               cassette > MAX_EPISODES_PER_CASSETTE; persister
                              refused to save; re-bills every run
  MISS:NOT_PERSISTED          test failed; save_cassette skipped; re-bills
  NOOP                        VCR-marked but no HTTP traffic (mocked elsewhere)
  UNMARKED:LIVE_CALL          test bypassed VCR AND opened a TCP connection
                              to a known LLM provider host -> wasted spend
  UNMARKED:NO_TRAFFIC         test bypassed VCR but didn't call out

The UNMARKED:LIVE_CALL signal is what converts 'this test probably hits
live' into 'this test connected to api.openai.com'. We install a
socket.connect / socket.create_connection wrapper for the duration of
each non-VCR-marked test and record any outbound TCP to a known LLM
provider hostname. The probe sits below the httpx layer so vcrpy and
respx (which both patch above the socket) are unaffected.

Replace the file-level _RESPX_CONFLICTING_FILES blacklists in the
llm_translation and local_testing conftests with per-item respx
detection in apply_vcr_auto_marker_to_items. A test now skips VCR when
it actually carries @pytest.mark.respx or has respx_mock in its fixture
chain - not just because some other test in the same file imports
MockRouter. Items skipped by skip_files are split into respx_conflict
(real conflict, the module wires up respx) vs file_opt_out (dead skip-
list entry whose module never touches respx) so the session summary
makes pruning obvious.

Stabilize the AWS SigV4 fingerprint: the Authorization header on
Bedrock requests rotates its Credential date and Signature on every
call, which previously pushed every Bedrock test past the 50-episode
overflow threshold. Extract the access-key id only
('aws-sigv4:AKIA...') so two requests with the same identity match.

Always emit verdict logging when VCR is active (set
LITELLM_VCR_VERBOSE=0 to opt back into the legacy quiet mode). Add a
session-end classification summary that lists overflow tests, unmarked
live-call tests, and the skip-reason breakdown.

Wire the live-call probe + summary hook into every test directory that
already uses the Redis-backed VCR cache (audio_tests, guardrails_tests,
image_gen_tests, litellm_utils_tests, llm_responses_api_testing,
llm_translation, local_testing, logging_callback_tests, ocr_tests,
pass_through_unit_tests, router_unit_tests, search_tests,
unified_google_tests).

Add tests/llm_translation/test_vcr_classification.py covering the
verdict classifier, skip-reason tagging, AWS SigV4 fingerprint stability,
live-host classification, and session summary rendering.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): drop dead 'from respx import MockRouter' imports

These seven test files were on _RESPX_CONFLICTING_FILES, which made the
auto-marker skip them entirely. Inspecting the source shows the only
respx artifact is a top-level 'from respx import MockRouter' that no
test ever uses - no @pytest.mark.respx, no respx_mock fixture, no
respx.mock context manager. The import is dead code left over from a
previous mocking pattern.

Now that apply_vcr_auto_marker_to_items detects respx per-item via the
marker / fixture chain (b637d9f64a), the file-level skip is no longer
needed for these files - they were the reason the OpenAI tests
(test_o3_reasoning_effort, test_streaming_response[o1/o3-mini],
TestOpenAIO1::test_streaming, TestOpenAIChatCompletion::test_web_search,
TestOpenAIO3::test_web_search, etc.) ran live every CI build despite
the cassette cache being healthy.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(image_edits): regenerate fixtures per call instead of holding open module-level file handles

Module-level

    TEST_IMAGES = [
        open(os.path.join(pwd, 'ishaan_github.png'), 'rb'),
        open(os.path.join(pwd, 'litellm_site.png'), 'rb'),
    ]
    SINGLE_TEST_IMAGE = open(...)

opens the file once at import. After the first multipart upload, the
file pointer is at EOF, so every subsequent test in the same xdist
worker sends an empty multipart body. That non-determinism (a) blows
the recorded cassette past MAX_EPISODES_PER_CASSETTE (50) so
_RedisPersister.save_cassette refuses to save it, and (b) re-bills the
live image edit endpoint on every CI run.

Recent CI runs confirm the leak: tests/image_gen_tests/test_image_edits.py
shows six tests parking at 51-52 cassette entries
(TestOpenAIImageEditGPTImage1::test_openai_image_edit_litellm_sdk[False],
TestOpenAIImageEditDallE2::..., test_openai_image_edit_with_bytesio,
test_openai_image_edit_litellm_router, test_multiple_vs_single_image_edit[False],
test_multiple_image_edit_with_different_formats).

Replace the module-level file handles with _make_test_images() /
_make_single_test_image() factories that return fresh _RewindableImage
(BytesIO subclass) objects whose pointer always starts at 0. The image
bytes are read once at import into module-level constants
(_ISHAAN_GITHUB_BYTES, _LITELLM_SITE_BYTES), so disk I/O cost is
unchanged.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(vcr): match real Bedrock hostnames in live-call probe

The suffix '.bedrock-runtime.amazonaws.com' never matched real Bedrock
endpoints, which use the format 'bedrock-runtime[-fips].{region}.amazonaws.com'
(region between 'bedrock-runtime' and 'amazonaws.com'). Add an explicit
host check for that pattern so Bedrock live calls are visible to the
probe, and update the unit test accordingly. Also drop the unused
'_LIVE_CALL_PROBE_INSTALLED' module variable.

* fix(vcr): cover full RFC1918 172.16.0.0/12 range in local prefixes

* fix(image_edits): drop _RewindableImage to prevent infinite multipart upload

The _RewindableImage(BytesIO) wrapper auto-rewound on every read after
EOF, which made the OpenAI SDK's multipart upload writer read the same
bytes forever instead of seeing EOF. Workers OOM'd / SIGKILL'd:

    [gw0] node down: Not properly terminated
    replacing crashed worker gw0
    ...
    worker 'gw1' crashed while running
        'tests/image_gen_tests/test_image_edits.py::TestOpenAIImageEditGPTImage1::test_openai_image_edit_litellm_sdk[False]'

The auto-rewind was added defensively for parametrized + flaky-retried
tests, but BaseLLMImageEditTest::test_openai_image_edit_litellm_sdk
already calls get_base_image_edit_call_args() once per invocation and
that helper now constructs fresh streams via _make_test_images(), so
rewinding inside the stream is unnecessary. Replace with plain BytesIO
seeded with the cached image bytes.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): mark Bedrock prompt-caching cross-call tests VCR-incompatible

The pass_through prompt-caching tests
(test_prompt_caching_returns_cache_read_tokens_on_second_call,
test_prompt_caching_streaming_second_call_returns_cache_read) make a
warm-up call and then assert the *second* call sees a non-zero
cache_read_input_tokens count from the upstream's prompt-cache. VCR
replay can't model cross-call provider state — both calls match the
same cassette episode, so the second call returns the first call's
pre-warmup response and the assertion fails:

    AssertionError: Expected cache_read_input_tokens > 0 on second call,
    but got 0. Full usage: {'input_tokens': 4986,
    'cache_creation_input_tokens': 4974, 'cache_read_input_tokens': 0}

This started biting after the AWS SigV4 fingerprint stabilization
(b637d9f64a): Bedrock requests now produce a stable per-access-key
fingerprint instead of a per-request signature, so cassettes
successfully replay where they previously always missed and re-recorded
live. Opt these tests out via skip_nodeid_suffixes so they run live and
match the existing pattern in tests/llm_translation/conftest.py
(::test_prompt_caching).

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): tighten OVERFLOW classification and switch respx detection to AST

Address two greptile P2 review concerns on PR #27795:

1. MISS:OVERFLOW was firing whenever total > MAX_EPISODES_PER_CASSETTE
   regardless of cassette state. A cassette that grew past the cap
   historically but this run only *replayed* (dirty=False) is
   healthy — the persister never tries to save, so the cache state is
   stable and the next run will replay too. Only flag OVERFLOW when
   dirty=True (new episodes were recorded that the persister would
   refuse to save). Add a regression test covering the
   dirty=False + large-total case.

2. _module_uses_respx did substring matching on the module source,
   which false-positives on comments / docstrings / string literals.
   A comment like # Previously tried respx.mock but switched to
   vcrpy would keep a file pinned on the opt-out list, defeating the
   dead-import pruning goal of this PR. Replace the substring scan
   with an ast.NodeVisitor (_RespxUsageVisitor) that only
   counts:

     - @pytest.mark.respx / @respx.mock decorators
     - with respx.mock(): ... (sync + async) context managers
     - respx.mock(...) calls outside a with/decorator
     - function parameters / fixture names equal to respx_mock

   Add tests for the comment / docstring / string-literal cases plus
   each real-usage pattern.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(vcr): aggregate worker stats on the controller so the session summary actually renders under xdist

`_session_stats` is a module-level dict mutated inside `_vcr_outcome_gate`
— which runs in each xdist worker process. The controller's
`pytest_terminal_summary` then reads its own empty `_session_stats` and
bails on `if not counts: return`, so the OVERFLOW / LIVE_CALL sections
the rest of this PR adds never make it into CI logs in the dist mode CI
actually uses.

Ship a structured `vcr_outcome` payload via `user_properties` (which
xdist round-trips) and add `aggregate_report_outcome` on the controller
to fold worker outcomes into `_session_stats`. The recording process
tags `vcr_recorded_by` with `PYTEST_XDIST_WORKER` so the controller can
tell "single-process — already counted locally" apart from "produced by
a worker — needs aggregation here", and not double-count when there's
no xdist.

Covered by 9 new unit tests in test_vcr_classification.py including the
end-to-end summary render path.

* fix(guardrails): improve CrowdStrike AIDR input handling (#26658)

* feat(lasso): add tool-calling support to LassoGuardrail (#27648)

* feat(lasso): extend LassoGuardrail to support tool calling (RND-5748)

* fix(lasso): PR review followups for tool-calling guardrail (RND-5748)

* fix(lasso): handle object-style tool_calls in _update_tool_calls_from_masked (RND-5748)

* fix(lasso): use model role for tool_use blocks (RND-5748)

* test(lasso): add round-trip tests for message transformation (RND-5748)

* fix(lasso): remove unused imports, handle Responses-API input masking, flatten multimodal content (RND-5748)

* fix(lasso): inspect Responses-API input field (RND-5748)

* fix(lasso): guard text-cursor remap against Lasso count mismatch (RND-5748)

* fix(lasso): flatten list content in tool_result.content (RND-5748)

* fix(lasso): remap multimodal list content during masking (RND-5748)

Bug: _map_masked_messages_back counted list-content messages in
original_text_count but the remap loop only handled isinstance(str).
The positional text_cursor never advanced for list messages, causing
all subsequent masked texts to be written onto the wrong messages.

Fix: added elif isinstance(content, list) branch that replaces the
list with the masked text string and advances the cursor — mirrors
the existing string-content branch. Also handles the assistant +
tool_calls combo for list-content messages.

Test: test_map_masked_messages_back_list_content verifies a user
message with [text + image_url] followed by an assistant message
gets correct masked content on both (cursor stays aligned).

* refactor(lasso): extract _get_field and _extract_tool_call_fields helpers (RND-5748)

The dict-vs-object access pattern (x.get('y') if isinstance(x, dict)
else getattr(x, 'y', None)) was duplicated 14 times across 5 methods.

_get_field(obj, field) — single-point dict/Pydantic field access.
_extract_tool_call_fields(call) — returns (call_id, name, parsed_input)
with JSON argument parsing, replacing ~30 duplicate lines in both
async_post_call_success_hook and _expand_messages_for_classification.

Also simplified _update_tool_calls_from_masked, _prepare_payload tool
mapping, and _apply_masking_to_model_response call_id extraction.

Net ~60 lines removed. No behavior change — all 32 tests pass.

* fix(lasso): add count guard to _apply_masking_to_model_response (RND-5748)

_apply_masking_to_model_response used a bare text_cursor without
verifying 1:1 correspondence between text-bearing choices and masked
text entries. If Lasso returned a different number of text messages
than choices with content, masked text would be applied to the wrong
choice or silently skip choices.

Added the same count-mismatch guard pattern already used in
_map_masked_messages_back: count original text-bearing choices,
compare to masked_text length, skip text remap on mismatch with a
warning log. Tool_call masking via id-based lookup is unaffected.

Tests:
- test_apply_masking_to_model_response_multiple_choices: verifies
  correct per-choice masked text with 2 choices
- test_apply_masking_to_model_response_count_mismatch: verifies
  content is left unchanged when counts disagree

* fix(lasso): close two guardrail-bypass paths flagged in review (RND-5748)

* tool-call args: when function.arguments is malformed JSON or parses
  to a non-object, preserve the raw string as {"arguments": <raw>} so
  Lasso still inspects it instead of receiving input=None. Covers both
  pre-call and post-call extraction (shared helper). Also resolves the
  CodeQL empty-except warning since the except body now assigns parsed=None.
* Responses-API input: when a request carries both "messages" and
  "input", inspect both. Previously a benign messages array let the
  guardrail skip data["input"] entirely. The masking write-back is
  split via a count boundary so masked messages flow back to
  data["messages"] and masked input flows back to data["input"]
  without cross-contamination.

Tests: malformed/non-object args round-trip, dual-field classification,
dual-field masking write-back split.

* chore(lasso): black formatting + comment on expand skip branch (RND-5748)

* black: wrap two long expressions in lasso.py and reformat dict
  literals in test_lasso.py to satisfy CI lint.
* add a short comment in _expand_messages_for_classification
  explaining why empty string and None content are intentionally
  skipped (None is the OpenAI shape for a pure tool-call turn).

* fix(lasso): satisfy mypy in _handle_masking, _update_tool_calls_from_masked, _apply_masking_to_model_response (RND-5748)

* Narrow `response.get("messages")` into a local before slicing so
  mypy doesn't see `Optional[List[Dict[str, str]]]` as non-indexable.
* Rename the two write-side `func` bindings in
  `_update_tool_calls_from_masked` to `func_dict` / `func_obj` so
  mypy doesn't unify the dict and Any|None branches.
* Rename the inner loop variable in `_apply_masking_to_model_response`
  from `msg` to `masked_msg` to avoid clashing with the
  `msg = choice.message` rebinding below.

No behavior change; resolves the 7 mypy errors from the CI lint job.

* perf: eliminate per-request callback scanning on proxy hot path (#27858)

- Introduce `_CallbackCapabilities` dataclass and `ProxyLogging._callback_capabilities()` static method that inspects `litellm.callbacks` once and caches capability flags keyed on (list length, member ids); invalidates automatically when the callback list mutates without per-request iteration overhead
- Replace O(n) `litellm.callbacks` walks in `async_pre_call_hook`, `during_call_hook`, `async_post_call_streaming_iterator_hook`, `async_post_call_streaming_hook`, and `post_call_response_headers_hook` with fast-path exits when no relevant callbacks are registered
- Add `needs_iterator_wrap()` and `needs_per_chunk_streaming_hook()` instance methods to decouple iterator-level wrapping from per-chunk hook execution; avoids `get_response_string` materialization per chunk when no guardrail or chunk-hook callback is active
- Introduce `_fast_serialize_simple_model_response_stream()` using `orjson` for common single-choice text streaming chunks, bypassing the full Pydantic serializer; falls back to `model_dump_json` for tool calls, logprobs, usage, and provider-specific fields
- Add early-return in `_restamp_streaming_chunk_model` when downstream model already matches the requested model, avoiding unnecessary string comparisons on every chunk
- Fix stale zero-cost cache bug in `_is_model_cost_zero`: move the per-router `_zero_cost_cache` dict onto the `Router` instance and clear it in `_invalidate_model_group_info_cache` so in-place pricing updates via `upsert_deployment` immediately resume budget enforcement
- Add `scripts/benchmark_chat_completions_perf.py`: standalone async benchmarking tool with a mock OpenAI provider, LiteLLM proxy process management, non-streaming RPS, streaming TTFT, and full-stream latency measurements with repeat/median run support
- Add comprehensive unit tests covering capability detection, cache invalidation, fast-path correctness, zero-cost cache regression, and the no-callback streaming fast path

Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>

* ci(mutmut): enable mutate_only_covered_lines to fit in CI budget (#27910)

The mutation-test workflow timed out at the 350-minute job cap when
running whole-folder mutation against litellm/proxy/management_endpoints/
(~30 files, ~1.5 MB of source). Every mutant was running the full
test suite, and mutants were generated for lines no test covers — which
would survive regardless, just wasting compute.

mutmut 3.x's mutate_only_covered_lines setting runs the suite once up
front to compute coverage, then skips mutating uncovered lines. This
cuts the mutant count dramatically and is the right semantic for the
score (no test → no kill possible → uncountable). Per-mutant test
filtering by function name is already automatic in mutmut 3.x; no
external coverage step is needed.

* fix(rate-limit): stop v3 limiter from leaking internal stash to provider body (#27913)

* fix(rate-limit): stop v3 limiter from leaking internal stash to provider body

PR #27001 (atomic TPM rate limit) introduced a reservation flow that
writes four LiteLLM-internal keys onto the request data dict:

  _litellm_rate_limit_descriptors
  _litellm_tpm_reserved_tokens
  _litellm_tpm_reserved_model
  _litellm_tpm_reserved_scopes
  _litellm_tpm_reservation_released

These keys are forwarded as request body params to the upstream provider,
which rejects them as unknown fields:

  OpenAI    -> 400 'Unknown parameter: _litellm_rate_limit_descriptors'
              (mapped by litellm to RateLimitError / 429, hiding the bug
               behind a misleading 'throttling_error' code)
  Anthropic -> 400 '_litellm_rate_limit_descriptors: Extra inputs are
               not permitted'

Net effect: every chat completion against any real provider fails the
moment a virtual key has any tpm_limit / rpm_limit set — i.e. v3-enforced
key-level TPM/RPM limits are broken end-to-end. The v3 RPM/TPM check
itself still runs (raises 429 on over-limit), but the success path
poisons the upstream body.

Reproduced on litellm_internal_staging HEAD (410ce761dc) against
gpt-4o-mini and claude-haiku-4-5 with a 1-RPM/1-TPM key — first request
fails with the provider's unknown-field error.

Fix: the stash is metadata only.

  - Add RATE_LIMIT_DESCRIPTORS_KEY constant and a _LITELLM_STASH_KEYS
    registry so we have a single source of truth for stash keys.
  - New helper _stash_value_in_metadata_channels writes to
    data['metadata'] / data['litellm_metadata'] without touching the
    top level.
  - _stash_reservation_in_data and the descriptor stash now route
    through that helper. _mark_reservation_released stops writing
    top-level.
  - _lookup_stashed_value also checks kwargs['metadata'] /
    kwargs['litellm_metadata'] (raw request_data shape) in addition to
    kwargs['litellm_params']['metadata'] (completion kwargs shape).
  - async_post_call_failure_hook now reads descriptors via the unified
    metadata lookup instead of request_data.get(top-level).
  - Defense in depth: async_pre_call_hook strips any stash key that
    somehow surfaced at the top level (stale cache, future refactor,
    test fixture) before returning.

Tests:
  - New regression test asserts no _litellm_* stash key is present at
    the top level of data after async_pre_call_hook, and that the
    metadata channel still carries the reservation + descriptors so
    success / failure reconciliation works.
  - Existing test_tpm_concurrent.py tests that asserted top-level
    presence are updated to read from data['metadata'] — the location
    is an implementation detail; the spec is that post-call callbacks
    can resolve the stash.

Verified end-to-end against OpenAI gpt-4o-mini and Anthropic
claude-haiku-4-5 via /v1/chat/completions on a low-rpm key:

  - With limits not exceeded: HTTP 200, valid completion response,
    no leaked fields in body.
  - With RPM exceeded: HTTP 429 from v3 enforcement
    ('Rate limit exceeded ... Limit type: requests').
  - With TPM exceeded: HTTP 429 from v3 enforcement
    ('Rate limit exceeded ... Limit type: tokens').

Full v3 hook test suite passes (171 tests).

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* chore(rate-limit): use RATE_LIMIT_DESCRIPTORS_KEY constant in test, trim noisy comments

Address greptile P2: test fixture now uses the imported constant.
Drop comments that re-explain what well-named identifiers already convey.

* fix(rate-limit): reject caller-supplied stash values to prevent TPM-refund abuse

Strip _LITELLM_STASH_KEYS from data top-level and both metadata channels at
the start of async_pre_call_hook. Without this, an authenticated caller can
inject _litellm_rate_limit_descriptors plus _litellm_tpm_reserved_tokens in
body metadata, trigger a proxy-side rejection, and cause
async_post_call_failure_hook to refund TPM counters against attacker-named
scopes (e.g. another tenant's api_key).

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix: allow for allowlisted redirect URIs (#27761)

* fix: allow for allowlisted redirect URIs

* github comment addressing

* Update litellm/proxy/_experimental/mcp_server/oauth_utils.py

Co-authored-by: veria-ai[bot] <224490171+veria-ai[bot]@users.noreply.github.com>

* harden oauth wildcard further

* test: cover wildcard entry with dot-leading suffix rejection

---------

Co-authored-by: veria-ai[bot] <224490171+veria-ai[bot]@users.noreply.github.com>

* Emit native web_search_tool_result blocks for Anthropic clients (Claude Desktop / Cowork citations) (#27886)

* feat(custom_logger): add async_post_agentic_loop_response_hook

Lets a CustomLogger shape the response returned by the agentic-loop
follow-up call without bypassing the loop's safety / observability
machinery (depth tracking, fingerprinting, etc.). Default returns the
response unchanged.

Used by websearch_interception to inject Anthropic-native
web_search_tool_result blocks when the originating client requested a
native web_search_* tool.

* feat(llm_http_handler): call post-agentic-loop hook on the originating callback

In _execute_anthropic_agentic_plan, after anthropic_messages.acreate
returns, call the originating callback's
async_post_agentic_loop_response_hook so it can mutate the final
response (e.g. inject native tool_result blocks). Pass the callback
through from _call_agentic_completion_hooks.

Exceptions in the post-hook are caught and logged so a buggy callback
can't kill the request.

* feat(websearch_interception): add is_anthropic_native_web_search_tool

Identifies tools the Anthropic-native clients (Claude Desktop, the
Anthropic SDK, the Anthropic Console) use to request native search:
type starts with "web_search_" (e.g. web_search_20250305). Rejects the
LiteLLM standard tool, the OpenAI-function variant, the bare
"WebSearch" legacy name, and the bare "web_search" Claude Code shape.

This lets us decide per-request whether the client expects
web_search_tool_result content blocks in the response, without
renaming any existing constants or touching native-provider skip
logic.

* feat(websearch_interception): add build_web_search_tool_result_block

Produces the Anthropic-native web_search_tool_result content block
from a structured SearchResponse. Anthropic-native clients use this
block to populate citations / source links — the existing text-blob
flatten path only feeds readable evidence to the model and discards
the structure, so this builder gives us the missing piece.

Shape matches https://docs.anthropic.com/en/api/web-search-tool —
web_search_result items carry url, title, page_age, encrypted_content
(empty string when the search provider doesn't supply one).

* feat(websearch_interception): emit native web_search_tool_result blocks

When the originating client request carried a native Anthropic
web_search_* tool, the final response now also carries
web_search_tool_result content blocks alongside the model's text
answer — so Claude Desktop / Anthropic SDK clients can populate the
citations panel and replay conversation history with structured search
evidence.

Wiring:
- Pre-request hooks (both deployment + Anthropic path) set a flag on
  kwargs when they see a native web_search_* tool, so the signal
  survives the conversion-to-litellm_web_search step regardless of
  which hook fires first.
- _execute_search now returns (text, SearchResponse) so the structured
  results aren't lost when the text is flattened for the follow-up
  model call.
- _build_anthropic_request_patch returns the parallel list of
  SearchResponse objects.
- async_build_agentic_loop_plan pre-builds the web_search_tool_result
  blocks (one per tool_use_id) and stashes them on plan.metadata when
  the flag is set.
- async_post_agentic_loop_response_hook reads the metadata and
  prepends the blocks to response.content.
- _execute_agentic_loop mirrors the injection for the legacy path so
  both paths behave identically.

Clients that send the LiteLLM standard tool keep the existing
text-only behavior — no regression.

* test(websearch_interception): cover native web_search_tool_result emission

18 tests across:
- detector branches (native vs litellm-standard, OpenAI-function shape,
  Claude Desktop builtin WebSearch, bare web_search, missing type)
- block-builder shape (results, none, empty)
- pre-request hook flag-setting (native sets, standard does not)
- async_build_agentic_loop_plan attaches blocks to plan.metadata when
  the flag is present, leaves metadata untouched when absent
- post-hook injection into dict and object responses
- legacy _execute_agentic_loop mirrors the injection so both paths
  return the same shape

* test(websearch_short_circuit): keep _execute_search mocks in sync with new tuple return

* test(websearch_thinking_constraint): keep _execute_search mocks in sync with new tuple return

* feat(websearch_interception): emit native blocks from try_short_circuit_search

The agentic-loop post-hook only fires when the model returns a tool_use
block. Cowork / Claude Desktop on Bedrock actually make TWO requests
per user turn: the main /v1/messages with their builtin tool, and a
separate standalone /v1/messages whose only tool is
web_search_20250305. That second request hits try_short_circuit_search
— no agentic loop, no post-hook — and was returning text-only, leaving
the citations panel empty.

When the short-circuit input carries a native web_search_* tool, build
a synthetic server_tool_use + web_search_tool_result pair (using the
structured SearchResponse already returned by _execute_search) so the
client gets the native shape it expects. The legacy text block is
preserved so non-native short-circuit callers (Claude Code,
github_copilot, etc.) see the same payload as before.

Failure path still emits the native block pair (with empty results)
plus the text-error block, so the client gets a well-formed response
rather than a malformed half-shape.

* test(websearch_native_blocks): cover short-circuit native-block emission

Three new cases on top of the existing 18:
- native web_search_20250305 short-circuit → [server_tool_use,
  web_search_tool_result, text], ids paired, urls/titles carried.
- litellm_web_search short-circuit → text-only (no regression).
- native short-circuit on search failure → still emits the native
  block pair (empty results) plus the text-error block, so the client
  never sees a malformed half-shape.

* test(websearch_short_circuit): index assertions by block type, not by position

Native short-circuit responses now have [server_tool_use,
web_search_tool_result, text] when the input carries
web_search_20250305 — find the text block by type rather than relying
on content[0].

* fix(websearch_interception): gate legacy WebSearch name on schema absence

Clients like Cowork / Claude Desktop ship a client-side tool named
"WebSearch" with a full input_schema — they handle it themselves and
expect to make a separate native web_search_20250305 sub-request for
the actual search.

Today is_web_search_tool matches the bare name regardless of other
fields, which hijacks the client's tool server-side. The agentic loop
fires on the main request, the model never gets to emit the
client-side tool_use, and the separate native sub-request (where
citation data flows) is never made. Net: citations panel empty.

Real Anthropic client tools always carry input_schema (the API rejects
them otherwise), so a bare {name: "WebSearch"} with no schema is the
only thing that could be a legacy interception marker. Gate the match
on schema absence: legacy callers (if any) keep working, real
client-side WebSearch tools pass through untouched.

* fix(websearch_interception): drop "WebSearch" from response-detection lists

Post-conversion the model always sees ``litellm_web_search``, so the
"WebSearch" entry in the response-side tool_use detection lists was
dead at best. If a model ever did return ``tool_use(name="WebSearch")``
it would now (incorrectly) hijack the client's own ``WebSearch`` tool
again — same Cowork problem we just fixed on the input side. Drop it.

* test(websearch_native_blocks): cover the WebSearch legacy-name schema gate

Three new cases:
- {name: "WebSearch"} (bare interception marker) → still matched
- {name: "WebSearch", input_schema: {...}} (Cowork client tool) →
  passes through untouched
- {name: "WebSearch", description: "..."} (no schema) → still matched
  on the assumption it's a legacy marker rather than a malformed real
  client tool.

---------

Co-authored-by: Ishaan Jaffer <ishaanjaffer0324@gmail.com>

* ci(codecov): restore litellm/ prefix on uploaded coverage paths

pytest-cov runs with --cov=litellm, which makes coverage.xml store paths
relative to the package root (e.g. `proxy/proxy_server.py` instead of
`litellm/proxy/proxy_server.py`). Codecov auto-resolves these only when
the basename is unique in the repo. Files like proxy_server.py, router.py,
utils.py, main.py, and constants.py — which have duplicates under
enterprise/ or other subpackages — get silently dropped during ingest.

The `fixes: ["::litellm/"]` rule prepends `litellm/` to every uploaded
path so they resolve unambiguously. Confirmed against multiple recent
coverage.xml artifacts that no uploader currently emits paths already
prefixed with `litellm/`, so the rule is safe to apply universally.

This restores Codecov visibility for the highest-fix-rate hotspots:
proxy_server.py, router.py, proxy/utils.py, litellm_logging.py,
constants.py, key_management_endpoints.py, utils.py, main.py,
user_api_key_auth.py, team_endpoints.py, and litellm_pre_call_utils.py.

* chore(ci): remove unused GitHub Actions workflows and orphan files

Audit of .github/workflows/ via gh run history shows the following have
either never run or have been dormant for 10+ weeks. CI coverage that
still matters is preserved on CircleCI (e.g. llm_translation_testing).

Removed workflows:
- test-litellm.yml — workflow_dispatch only, last run 2026-02-12 (cancelled);
  CCI local_testing_part1/2 covers the same tests
- llm-translation-testing.yml — last run 2025-07-10; replaced by CCI
  llm_translation_testing job (run_llm_translation_tests.py kept for the
  make test-llm-translation target)
- run_observatory_tests.yml — last run 2026-03-03 (cancelled)
- scan_duplicate_issues.yml — last run 2026-03-02 (failure)
- publish_to_pypi.yml — never run
- read_pyproject_version.yml — fires on every push to main but its echoed
  version output is not consumed by any downstream step

Removed orphan files (no callers in workflows, CCI, or Makefile):
- .github/workflows/README.md — documented only publish_to_pypi.yml
- .github/workflows/update_release.py + results_stats.csv
- .github/actions/helm-oci-chart-releaser/

* Revert "ci(codecov): restore litellm/ prefix on uploaded coverage paths"

This reverts commit e25a988a3f.

The `fixes: ["::litellm/"]` rule turned out to be applied *after* Codecov's
auto-resolution, not before. Files with unique basenames (which were
auto-resolving correctly to `litellm/<path>`) got an extra `litellm/`
prepended, producing `litellm/litellm/<path>` storage. Files with
ambiguous basenames (the actual target of the fix) continued to be
dropped because the auto-resolution still failed for them.

Net result on the verification run: 1375 files now stored under
unresolvable `litellm/litellm/...` paths, and the 11 originally-missing
hotspots are still missing. Reverting before piling on further changes.

* test(ui): preserve global Button/Tooltip mocks in per-file @tremor/react vi.mock

Per-file `vi.mock("@tremor/react", ...)` factories fully replace the
setup-level mock from `tests/setupTests.ts`, so the global Button/Tooltip
overrides are lost in any file that re-mocks `@tremor/react`. Without
them, the real Tremor `<Button>` leaks through and its internal
`useTooltip(300)` schedules a native 300ms `setTimeout` on pointer
events. When the test environment is torn down before the timer fires,
the trailing `setState` calls `getCurrentEventPriority`, which reads
`window.event` against a destroyed jsdom -> "window is not defined"
flake observed on CI.

Patches the 7 leaky test files to re-supply `Button` (bare `<button>`)
and `Tooltip` (Fragment) overrides matching `setupTests.ts`. Also drops
a dead `afterEach` workaround in `user_edit_view.test.tsx` (the
fake-timer dance it ran could not drain a real timer scheduled before
the swap) and corrects a misleading comment in `MakeMCPPublicForm.test.tsx`.

* ci: use --cov=./litellm so coverage paths resolve unambiguously in Codecov

pytest-cov treats --cov=<module-name> as a Python package and emits XML
paths relative to the package root, stripping the litellm/ prefix
(`proxy/proxy_server.py` instead of `litellm/proxy/proxy_server.py`).
Codecov's auto-prefix heuristic then drops every file whose basename is
ambiguous in the repo — `proxy_server.py` (3 copies under enterprise/),
`router.py` (2 copies), `utils.py` (20+), `main.py` (20+), `constants.py`
(2). The 11 highest-fix-rate hotspots have never appeared in Codecov.

Switching to --cov=./litellm treats the argument as a path, which makes
coverage.xml emit repo-relative paths (`litellm/proxy/proxy_server.py`).
Each path is unambiguous, so Codecov resolves all files correctly.

Verified locally: rerunning a single proxy_unit_tests test with
--cov=./litellm produced `filename="litellm/proxy/proxy_server.py"`,
`filename="litellm/router.py"`, and `filename="litellm/types/router.py"`
as distinct entries — exactly the disambiguation Codecov needs.

Touches every workflow that uploads coverage: the two reusable GHA
workflows (_test-unit-base.yml, _test-unit-services-base.yml),
test-mcp.yml, and all 14 invocations in .circleci/config.yml.

* fix(mcp): allow delegate PKCE bypass for internal MCP servers

Remove available_on_public_internet gating from delegate-auth-to-upstream
paths so oauth2 + delegate_auth_to_upstream interactive servers behave
the same when marked internal. Keeps M2M exclusion. Updates tests.

* chore(mcp): warn on internal + upstream PKCE delegate

Log verbose_logger.warning when loading oauth2 interactive servers with
available_on_public_internet=false and delegate_auth_to_upstream=true
(config + DB). Dashboard Alert for the same combo. CLAUDE note for
operators. Tests for log and M2M skip.

* fix(mcp): dedupe load_servers_from_config alias block

Removes accidental duplicate alias/mcp_aliases and get_server_prefix
logic (fixes PLR0915 and avoids resetting alias after mapping).

* fix(mcp): expose delegate_auth_to_upstream in MCP server list rows (#27936)

_build_mcp_server_table omitted delegate_auth_to_upstream, so GET /v1/mcp/server always returned the default false while the registry kept the DB value.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(proxy): fix vector store retrieve/list/update/delete without model (#27929)

* feat(proxy): fix vector store retrieve/list/update/delete routing without model

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(proxy): remove unchecked query-param injection in vector store management endpoints

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(proxy): use subset assertion for vector store route test to allow extra kwargs like shared_session

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(managed_batches): convert raw output_file_id to managed ID in CheckBatchCost poller (#27984)

* fix(managed_batches): convert raw output_file_id to managed ID in CheckBatchCost poller

CheckBatchCost bypasses async_post_call_success_hook, causing raw provider
output_file_ids to be persisted in LiteLLM_ManagedObjectTable. This fix converts
output_file_id and error_file_id to managed base64 IDs before the DB write.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(check_batch_cost): persist managed file before mutating response and propagate team_id

- Move setattr after store_unified_file_id so the response only receives the
  managed ID once the DB record is successfully written. Avoids serializing
  an orphaned managed ID into file_object when the store call fails.
- Populate team_id on the minimal UserAPIKeyAuth from job.team_id so the
  managed file record is created with the correct team ownership, allowing
  other team members to access the batch output file via /files/{id}/content.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* test(managed_batches): extend test to cover error_file_id conversion

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix managed file test

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(vertex-ai): fix zero cost/usage on completed Vertex AI batch jobs (#27912)

* fix(vertex-ai): fix zero cost/usage on completed Vertex AI batch jobs

Vertex batch jobs recorded 0 spend and 0 tokens after PR #25627 added
automatic transformation of GCS predictions.jsonl to OpenAI format.

Two bugs fixed:

1. batch_utils.py: the Vertex-specific cost/usage reader
   (calculate_vertex_ai_batch_cost_and_usage) was always invoked and
   reads raw usageMetadata fields that no longer exist in the
   OpenAI-shaped output. Now the reader is only used when
   disable_vertex_batch_output_transformation=True; otherwise the
   generic path handles the already-transformed OpenAI-shaped content.

2. cost_calculator.py: batch_cost_calculator skipped the global
   litellm.get_model_info() lookup when a model_info dict was passed
   in, even when that dict had no pricing fields (e.g. deployment
   metadata with only id/db_model). It now falls back to the global
   pricing table when the provided model_info has no pricing data.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Update litellm/cost_calculator.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix(cost-calculator): use not-any guard for pricing fallback in batch_cost_calculator

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(cost-calculator): treat explicit zero batch pricing as set in model_info

The fallback to litellm.get_model_info() used truthy checks on pricing
fields, so 0.0 was treated as missing and replaced by global rates.
Use `is not None` like elsewhere in cost calculation. Add regression test.

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* feat: add weighted-routing failover (#27980)

* Feat: Add Weighted-Routing Failover

* test(router): cover weighted failover helper functions

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(router): align weighted failover deployment list type with mypy

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(router): address greptile review on weighted failover

- Narrow exception swallowing in `_maybe_run_weighted_failover` to
  `openai.APIError` so model failures defer to the regular fallback
  while programming bugs (AttributeError/KeyError/TypeError) surface.
- Note async-only limitation of `enable_weighted_failover` in the
  Router constructor docstring.
- Make the weighted distribution test less flaky (1000 iterations,
  looser bound) and make the non-simple-shuffle test deterministic by
  failing both deployments instead of relying on the latency strategy's
  first pick.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(router): ensure weighted failover metadata persists in kwargs

The previous `kwargs.setdefault(metadata_variable_name, {}) or {}` returned
a brand-new dict whenever the existing metadata was falsy (empty dict or
None), so writes to `_failover_excluded_ids` never made it back into
`kwargs`. Multi-hop weighted failover then re-selected previously failed
deployments and exhausted `max_fallbacks` prematurely.

Explicitly assign a fresh dict into kwargs when metadata is missing so
mutations are visible to subsequent failover hops.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* test(router): regression for weighted failover metadata persistence

Asserts kwargs["metadata"]["_failover_excluded_ids"] is populated after
_maybe_run_weighted_failover, proving the metadata dict written by the
helper is the same object that lives in kwargs (no disconnected copy).
Pairs with the prior fix that replaced `setdefault(..., {}) or {}` with
an explicit get/assign so writes survive across hops.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(router): harden weighted failover error/state handling

- Catch RouterRateLimitError (ValueError) alongside openai.APIError in
  _maybe_run_weighted_failover so an exhausted intra-group retry falls
  through to the regular cross-group fallback path instead of bubbling
  out and bypassing configured fallbacks.
- Stop mutating the shared input_kwargs dict; build a local copy with
  the weighted-failover keys so the entry (with _excluded_deployment_ids)
  cannot leak into later fallback paths reading the same dict.
- _get_excluded_filtered_deployments now returns an empty list when the
  exclusion filter removes every healthy deployment, instead of falling
  back to the original list. The original-list behavior risked re-picking
  the just-failed deployment; callers already handle the empty case by
  raising their no-deployments error, which weighted failover now catches
  and converts into a normal cross-group fallback.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(router): fall through to rpm/tpm when total weight is zero

When the weight metric's total is zero (e.g. after weighted-failover
exclusion leaves only zero-weight backups), continue to the next metric
(rpm/tpm) instead of returning a uniform random pick immediately. This
lets rpm/tpm still drive routing when present, and only falls back to
the uniform random pick at the end if no metric provides a positive
total weight.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(router): skip weighted failover when remaining deployments are all in cooldown

_maybe_run_weighted_failover was computing 'remaining' from all_deployments
(every deployment in the model group, including those in cooldown). This meant
that when all non-excluded deployments were in cooldown the method still invoked
run_async_fallback unnecessarily, which propagated into async_get_healthy_deployments,
found no eligible deployments, and raised RouterRateLimitError — only safely
caught thanks to the earlier exception-broadening fix.

The fix: before computing 'remaining', fetch the current cooldown set via
_async_get_cooldown_deployments and subtract it from all_ids. This allows
_maybe_run_weighted_failover to return None immediately (skipping the
run_async_fallback call entirely) when every non-failed deployment is in cooldown,
letting the caller fall through to the correct cross-group fallback path without
the wasteful extra round-trip.

Tests added:
- unit: _maybe_run_weighted_failover returns None without calling run_async_fallback
  when all remaining deployments are in cooldown
- unit: _maybe_run_weighted_failover still calls run_async_fallback when at least
  one healthy (non-cooldown) deployment is available
- integration: end-to-end fallthrough to cross-group fallback when remaining
  deployments are in cooldown

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* fix(bedrock-mantle): use /anthropic/v1/messages path for Mantle endpo… (#27976)

* fix(bedrock-mantle): use /anthropic/v1/messages path for Mantle endpoint (#27943)

* docs: add one-line docstring to _disable_debugging (#27894)

Squash-merged by litellm-agent from oss-agent-shin's PR.

* Add jp. Bedrock cross-region inference profile for claude-sonnet-4-6 (#27831)

Squash-merged by litellm-agent from Cyberfilo's PR.

* Sanitize empty text content blocks on /v1/messages (#27832)

Squash-merged by litellm-agent from Cyberfilo's PR.

* fix(bedrock-mantle): use /anthropic/v1/messages path for Mantle endpoint

The bedrock-mantle gateway (Claude Mythos Preview) serves the Anthropic
Messages API at /anthropic/v1/messages; /v1/messages returns 404 Not
Found. Both AmazonMantleConfig (chat/completions caller route) and
AmazonMantleMessagesConfig (anthropic-messages caller route) hardcoded
the wrong path, so every Mantle request 404'd before reaching the model.

Per the Anthropic docs: "[Claude in Amazon Bedrock] uses the Messages
API at /anthropic/v1/messages with SSE streaming."
https://platform.claude.com/docs/en/api/claude-on-amazon-bedrock

Confirmed independently against the live endpoint:
  /v1/chat/completions      -> 200 OK
  /v1/messages              -> 404 Not Found  (what litellm used)
  /anthropic/v1/messages    -> 200 OK         (Claude only)

Adds a regression test asserting both Mantle configs build the
/anthropic/v1/messages path, and updates the existing assertions that
encoded the wrong path.

---------

Co-authored-by: oss-agent-shin <ext-agent-shin@berri.ai>
Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com>

* fix: sanitize empty text blocks in sync anthropic_messages_handler path

Co-authored-by: Yassin Kortam <yassin@berri.ai>

---------

Co-authored-by: João Costa <13508071+jpv-costa@users.noreply.github.com>
Co-authored-by: oss-agent-shin <ext-agent-shin@berri.ai>
Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(utils): import get_secret at runtime (#28014)

* fix(proxy): make /config/update env-var encryption idempotent

A single decrypt-then-encrypt chokepoint (_encrypt_env_variables_for_db)
now backs both update_config and save_config. Re-submitting a value the
Admin UI read back from /get/config/callbacks as ciphertext no longer
stacks a second encryption layer, which previously decrypted to garbage
and silently broke the callback. The chokepoint decrypts with the pure
_decrypt_db_variables (no os.environ mutation on the write path) and
encrypts exactly once; update_config merges only the sent keys so
untouched env vars keep their stored ciphertext byte-for-byte.

* test(proxy): add endpoint-level regression for /config/update double-encryption

Adds test_update_config_env_var_round_trip_not_double_encrypted, which
drives the real /config/update handler: first write plaintext, then
re-POST the stored ciphertext (the Admin UI round-trip) and assert the
value is not stacked with a second encryption layer and untouched keys
stay byte-identical. Verified to fail against the pre-fix handler and
pass after. Also tightens the unit test to exactly three ciphertext
re-feeds.

* chore(ci): modernize model references in tests and configs (#27856)

* test: modernize models used in CircleCI e2e test suites

Replaces obsolete models (gpt-4o, gpt-4o-mini, gpt-3.5-turbo,
claude-3-5-sonnet-20240620, claude-sonnet-4-20250514) with current
equivalents across the e2e_openai_endpoints and
proxy_e2e_anthropic_messages_tests CircleCI jobs.

- gpt-4o -> gpt-5.5 (responses API e2e tests)
- gpt-4o-mini -> gpt-5-mini (websocket responses, oai_misc_config)
- gpt-4o-mini-2024-07-18 -> gpt-4.1-mini-2025-04-14 (fine-tuning,
  still actively fine-tunable)
- gpt-4 / gpt-3.5-turbo target_model_names example -> gpt-5.5 /
  gpt-5-mini
- bedrock claude-3-5-sonnet-20240620 batch entry -> haiku-4-5-20251001
  (also aligning oai_misc_config model_name with what
  test_bedrock_batches_api.py actually requests)
- bedrock claude-sonnet-4-20250514 (deprecated, retires 2026-06-15)
  -> claude-sonnet-4-5-20250929

* test: point bedrock-claude-sonnet-4 alias at Sonnet 4.6, not 4.5

Greptile/Cursor flagged that after the previous commit, the
bedrock-claude-sonnet-4 alias collided with bedrock-claude-sonnet-4.5
(both pointed to claude-sonnet-4-5-20250929). Rename to
bedrock-claude-sonnet-4.6 and point it at the Sonnet 4.6 Bedrock ID
(us.anthropic.claude-sonnet-4-6, already in the litellm model
registry) so the alias name matches the underlying model version.

* test: modernize models across remaining CI-mounted configs & tests

Expands the modernization sweep to all CircleCI-mounted proxy configs
and to test directories where the model literal is a fixture/route key
(not the test's subject).

Config changes:
- proxy_server_config.yaml: bump gpt-3.5-turbo / gpt-3.5-turbo-1106 /
  gpt-4o / gemini-1.5-flash / dall-e-3 underlying models; rename
  gpt-3.5-turbo-end-user-test alias to gpt-5-mini-end-user-test; bump
  text-embedding-ada-002 underlying to text-embedding-3-small. User-
  facing aliases (gpt-3.5-turbo, gpt-4, text-embedding-ada-002, etc.)
  preserved for backward compatibility with tests.
- simple_config.yaml, otel_test_config.yaml, spend_tracking_config.yaml:
  bump gpt-3.5-turbo underlying to gpt-5-mini.
- pass_through_config.yaml: claude-3-5-sonnet / claude-3-7-sonnet /
  claude-3-haiku entries replaced with claude-sonnet-4-5 / claude-
  haiku-4-5 / claude-opus-4-7.
- oai_misc_config.yaml: align alias name with the gpt-5-mini rename.

Test changes (proactive: claude-sonnet-4-20250514 / claude-opus-4-
20250514 retire 2026-06-15):
- tests/llm_translation/test_anthropic_completion.py: bump 3 references
  + paired Vertex AI ID to claude-sonnet-4-5.
- tests/llm_translation/test_optional_params.py: bump 2 references.
- tests/pass_through_unit_tests/test_anthropic_messages_passthrough.py
  and test_bedrock_anthropic_messages_test.py: bump router fixtures
  using the deprecated model IDs.
- tests/pass_through_unit_tests/base_anthropic_messages_tool_search_test.py:
  modernize docstring examples.
- tests/test_end_users.py: update references to renamed alias.

* test: modernize placeholder model literals in router_unit_tests

Mass replace_all on fixture/placeholder model literals across the
router_unit_tests/ suite (model name is a routing key / label, not the
test subject). Sub-agent sweep so far — additional commits will follow
for logging_callback_tests/, enterprise/, top-level tests/test_*.py,
and other CI-mounted dirs.

Mappings applied:
- gpt-3.5-turbo -> gpt-5-mini
- gpt-4 (bare) -> gpt-5.5
- gpt-4o (bare) -> gpt-5
- text-embedding-ada-002 -> text-embedding-3-small
- claude-3-sonnet-20240229 / claude-3-opus-20240229 /
  claude-3-haiku-20240307 / claude-3-5-sonnet-20240620 ->
  claude-sonnet-4-5-20250929 / claude-opus-4-7 /
  claude-haiku-4-5-20251001 as appropriate

Explicitly preserved:
- gpt-4o-mini-* variants (transcribe, tts, etc.) where they're current
- gpt-4-turbo / gpt-4-vision-preview / gpt-4-0613 (subject literals)
- JSONL batch body literals
- Mock LLM response model fields (must match upstream)
- Fake/mock identifiers

* test: modernize placeholder model literals across remaining CI suites

Sub-agent sweep across logging_callback_tests/, guardrails_tests/,
enterprise/, pass_through_unit_tests/, otel_tests/,
llm_responses_api_testing/, batches_tests/, spend_tracking_tests/,
litellm_utils_tests/, unified_google_tests/, and a few top-level
tests/test_*.py files where the model literal is a fixture or
placeholder (router model_list, mock standard logging payload, mock
callback data) rather than the test's subject.

Mappings applied (see scope notes below):
- gpt-3.5-turbo -> gpt-5-mini
- gpt-4 (bare) -> gpt-5.5
- gpt-4o (bare) -> gpt-5.5 (corrected from initial gpt-5 — bare gpt-5
  is not a valid OpenAI alias; only gpt-5.5 / gpt-5.4 / gpt-5.2-codex
  / gpt-5-mini exist)
- gpt-4o-mini (bare) -> gpt-5-mini
- text-embedding-ada-002 -> text-embedding-3-small
- claude-3-sonnet-20240229 -> claude-sonnet-4-5-20250929
- claude-3-opus-20240229 -> claude-opus-4-7
- claude-3-haiku-20240307 -> claude-haiku-4-5-20251001
- claude-3-5-sonnet-20240620/20241022 -> claude-sonnet-4-5-20250929
- claude-3-7-sonnet-20250219 -> claude-sonnet-4-6
- gemini-1.5-flash -> gemini-2.5-flash
- gemini-1.5-pro -> gemini-2.5-pro

Explicitly preserved (not modernized):
- llm_translation/ tests where model is the SUBJECT (provider-specific
  translation/transformation logic). Only the deprecated 20250514
  references were already bumped in a prior commit.
- Cost-calc / tokenizer subject tests in test_utils.py (skip-ranges
  documented by the sub-agent).
- Bedrock model IDs in test_health_check.py path-stripping tests.
- JSONL batch request bodies and mock LLM response bodies (must match
  upstream literal).
- Langfuse expected-request-body JSON fixtures (cost values are exact-
  match-asserted; changing the model would shift response_cost).
- gpt-3.5-turbo-instruct (text-completion endpoint; no modern OpenAI
  equivalent).
- Top-level tests calling the proxy through user-facing aliases
  (gpt-3.5-turbo, gpt-4, text-embedding-ada-002, dall-e-3) — aliases
  in proxy_server_config.yaml stay; only the underlying model was
  bumped.
- tests/test_gpt5_azure_temperature_support.py (the test's whole point
  is model-name handling).
- Fake / mock / openai/fake identifiers.

Notable side fixes:
- test_spend_accuracy_tests.py: UPSTREAM_MODEL now matches what
  spend_tracking_config.yaml's proxy actually routes to (gpt-5-mini),
  resolving a latent inconsistency.
- proxy_server_config.yaml: bare `gpt-5` alias renamed to `gpt-5.5`
  (bare gpt-5 is not a valid OpenAI alias).
- test_batches_logging_unit_tests.py: explicit_models list entries
  kept distinct (gpt-5-mini + gpt-5.5) after bulk rename.

* test: fix CI failures from model modernization sweep

CI surfaced 4 categories of regression from the bulk modernization:

1. Azure deployment names are customer-specific. Reverted:
   - tests/litellm_utils_tests/test_health_check.py: azure/text-
     embedding-3-small -> azure/text-embedding-ada-002 (the CI Azure
     account does not have a text-embedding-3-small deployment).
   - tests/logging_callback_tests/test_custom_callback_router.py:
     same revert for two router fixtures driving aembedding.

2. gpt-5 family does not accept temperature != 1. Tests that pass a
   custom temperature swapped from gpt-5-mini to gpt-4.1-mini (modern
   non-reasoning OpenAI mini that still accepts temperature/logprobs):
   - tests/logging_callback_tests/test_datadog.py
   - tests/logging_callback_tests/test_langsmith_unit_test.py
   - tests/logging_callback_tests/test_otel_logging.py

3. proxy_server_config.yaml's gpt-3.5-turbo-large alias was routing to
   gpt-5.5 (a reasoning model that rejects logprobs). The proxy test
   tests/test_openai_endpoints.py::test_chat_completion_streaming
   exercises logprobs/top_logprobs through that alias. Bumped the
   underlying model to gpt-4.1 (non-reasoning, still modern).

4. tests/logging_callback_tests/test_gcs_pub_sub.py asserts against a
   pinned JSON fixture (gcs_pub_sub_body/spend_logs_payload.json) with
   hardcoded model="gpt-4o" and a model-specific spend value. Reverted
   the litellm.acompletion calls in the test to model="gpt-4o" so the
   fixture's exact-match assertions still hold.

5. tests/pass_through_unit_tests/test_anthropic_messages_passthrough.py:
   anthropic.messages.create routing to openai/gpt-5-mini returned an
   empty content[0] with max_tokens=100 (reasoning-token consumption).
   Swapped to openai/gpt-4.1-mini.

* test: fix Assistants API model + 2 cursor[bot] review nits

1. pass_through_unit_tests/test_custom_logger_passthrough.py: gpt-5.5
   isn't accepted by the /v1/assistants endpoint
   ("unsupported_model"). Switch to gpt-4.1-mini (modern, Assistants-
   API-supported, non-reasoning).

2. example_config_yaml/pass_through_config.yaml: the previous sweep
   bumped the claude-3-7-sonnet alias to claude-opus-4-7, which is a
   tier change (Sonnet -> Opus). Map to claude-sonnet-4-6 to keep the
   Sonnet tier intact. (Cursor bugbot review.)

3. example_config_yaml/simple_config.yaml: model_name was left as
   gpt-3.5-turbo while the underlying was bumped to gpt-5-mini, which
   muddles the "simple" example. Make both sides gpt-5-mini so the
   most basic example is a straight 1:1 mapping again. (Cursor bugbot
   review.)

* fix: revert gpt-4/gpt-3.5-turbo alias underlying to non-reasoning models

tests/test_openai_endpoints.py::test_completion calls the proxy alias
"gpt-4" with temperature=0, and other tests call gpt-3.5-turbo with
custom temperature / logprobs / the legacy /v1/completions endpoint.
The earlier modernization mapped both aliases to gpt-5.5 / gpt-5-mini,
which are reasoning models that reject temperature != 1 and don't
expose /v1/completions. Map the aliases to gpt-4.1 / gpt-4.1-mini
(modern non-reasoning OpenAI models) instead — keeps user-facing
aliases preserved while picking a current underlying that still
supports the parameters/endpoints the tests exercise.

* test(proxy): isolate run_server CLI tests from prisma DB-setup path

test_keepalive_timeout_flag and test_timeout_worker_healthcheck_flag
were the only run_server tests in test_proxy_cli.py that neither
stripped DATABASE_URL/DIRECT_URL nor mocked the prisma DB path. When a
DATABASE_URL is present (CI/env leak), run_server --local enters the DB
block and blocks in the un-timeout'd subprocess.run(["prisma"]) at
proxy_cli.py:987 plus the ProxyExtrasDBManager migrate-deploy retry
loops, ~370s per test on the CI runner. --dist=loadscope pins both to
one xdist worker, so the proxy-infra job appears stuck at 99% and hits
the 20-min timeout.

Apply the same isolation every other run_server test in this file
already uses: mock PrismaManager.setup_database +
should_update_prisma_schema and strip DATABASE_URL/DIRECT_URL. Full
module drops from 31.7s to 2.9s locally; both tests fall off the slow
list.

* feat: add OTEL GenAI latest-experimental semantic convention support (#27418)

- Introduce `OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental` opt-in that switches OTEL traces to conform with the OpenTelemetry GenAI semantic conventions specification
- Extract all semconv behavior into a new `OTELGenAISemconvMixin` class in `gen_ai_semconv.py`, mixed into `OpenTelemetry` to keep concerns separated
- In semconv mode, span name follows `{operation} {model}` pattern (e.g. `chat gpt-4`) and span kind is set to `CLIENT` instead of legacy `litellm_request`
- Replace `gen_ai.system` with `gen_ai.provider.name` and drop `llm.is_streaming` in semconv mode; add `gen_ai.request.{frequency_penalty,presence_penalty,top_k,seed,stop_sequences,stream,choice.count}` and `gen_ai.usage.cache_{creation,read}.input_tokens` attributes
- Replace per-message `gen_ai.content.prompt` / per-choice `gen_ai.content.completion` log events with a single consolidated `gen_ai.client.inference.operation.details` event; omit `gen_ai.input/output.messages` when content capture is disabled
- Suppress the non-standard `raw_gen_ai_request` child span entirely in semconv mode
- Support both programmatic (`OpenTelemetryConfig.semconv_stability_opt_in` field) and environment variable activation; the two sources are unioned so either or both can enable the opt-in
- Extract OTEL SDK `LogRecord` / `SeverityNumber` version-compatibility shim into a reusable `_otel_log_types()` static method to deduplicate the `< 1.39.0` / `>= 1.39.0` import branching
- Add 30+ unit tests covering opt-in gating, span naming, attribute emission/omission rules, stop sequence normalization, cache token attributes, and the consolidated event lifecycle

Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>

* chore: retrigger CI

* test(ci): add reasoning_effort grid v4 e2e regression suite

Encode the 231-cell QA sweep (21 provider x model combos x 11 effort
values) from #27039 / #27074 as an automated CircleCI-gated regression
suite. Each cell hits the real provider endpoint, captures the outgoing
wire body via a pre-call CustomLogger, and asserts:

- thinking.type, output_config.effort, thinking.budget_tokens, max_tokens
  in the captured request body (regression signal for silent drops/strips
  in any provider transformation)
- HTTP status (200 vs BadRequestError -> 400) returned by litellm
  (regression signal for clean-error vs leaked-500 mappings)

The matrix is encoded as a small rule set keyed by (model_mode, effort)
plus per-model xhigh/max capability overrides, then expanded across the
five chat-completion routes (Anthropic direct, Azure AI Foundry, Vertex
AI, Bedrock Converse, Bedrock Invoke /chat) and the Bedrock Invoke
/v1/messages route. Cells skip at runtime when the route's provider env
vars are absent, so PR builds without credentials no-op gracefully.

Wired into CircleCI as the reasoning_effort_grid_v4_e2e job behind the
existing main / litellm_* branch filter.

* fix(reasoning_effort_grid_v4): cleanup unused fixture, parse converse body, guard budget tokens

- Remove unused vertex_credentials_path fixture (and now-unused os import)
  from conftest.py.
- Parse Bedrock Converse complete_input_dict (logged as a JSON string by
  converse_handler.py) before passing to _assert_cell, so dict accessors
  work uniformly across routes.
- Extend _BUDGET_TOKENS with xhigh and max entries so the budget-mode
  branch in expected() cannot KeyError if a future budget model gains
  the matching cap.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(reasoning_effort_grid_v4): grant sonnet-4-6 entries the max-effort cap

The runtime _validate_effort_for_model allows effort='max' for any
Claude 4.6 model (opus or sonnet), and model_prices_and_context_window
sets supports_max_reasoning_effort: true for claude-sonnet-4-6. The
grid spec previously gave sonnet-4-6 entries _CAPS_NONE, so expected()
returned status=400 for effort='max', which mismatched the runtime's
status=200 and caused 6 cells (one per route) to fail.

Rename _CAPS_OPUS_4_6 to _CAPS_4_6 (since the cap set is shared by
opus and sonnet 4.6) and assign it to all sonnet-4-6 entries.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* refactor(tests): move reasoning_effort grid suite under llm_translation, drop v4 naming

- Drop the "v4" suffix throughout: it referred to the QA sweep iteration,
  not this test suite. There's only one regression suite, so just call it
  reasoning_effort_grid.
- Move tests/test…

* Revert "merge main (#28629)"

This reverts commit e4870f7bb68cf1dbc43b4bae9e5ee09b86aea481.

* refactor(purview): remove unused logging_only constructor param

The logging_only kwarg was passed in but only echoed in a log message;
behavioral mode is driven by event_hook. Drop the dead parameter and
the no-op derivation in the initializer.

* fix(purview): log post-call success hook via @log_guardrail_information

* fix(purview): raise HTTPException 400 on Responses API transform failure

When the Responses API input cannot be transformed in blocking mode,
raise an HTTPException with a clear 400 detail instead of re-raising the
raw transformation exception. The latter would surface as a 500 and be
indistinguishable from a backend provider failure; the former matches
the docstring and the existing pre-call HTTPException patterns.

* Fix Purview audit prompt extraction for Responses API logging hook

The async_logging_hook receives kwargs == litellm_logging_obj.model_call_details,
where function_setup mirrors the raw responses input under the 'messages' key
(string or list of input items, not chat-format messages). The previous logic
took the generic 'messages' branch first, calling get_prompt_text_for_dlp on
data that is not in the chat-message format, and the responses-specific
extractor was never reached. As a result, the prompt half of the Purview
audit was silently skipped (or produced garbage text) for Responses API calls
in logging_only mode.

Check call_type first and route Responses API calls to
_responses_api_input_to_str, which reads the original 'input' and
'instructions' keys that pre_call and update_environment_variables persist on
model_call_details.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(purview_dlp): check call_type for responses before messages in pre-call hook

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(purview_dlp): don't reject empty-string prompts as token-id prompts

completion_prompt_to_str returns None for both token-id lists *and*
empty/whitespace-only strings (stripped). The previous check 'raw_prompt
is not None and prompt_text is None' conflated these cases, raising the
misleading 'Token-id completion prompts cannot be scanned' error for
harmless empty-string prompts like {"prompt": ""}.

Tighten the check to only reject true token-id prompts (non-empty list
of ints). Empty/whitespace string prompts now fall through to the
'no prompt text → skip scan' path.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(purview_dlp): scan ResponsesAPI streaming completed event for DLP

The streaming iterator hook previously routed all assembled streams through
stream_chunk_builder, which only knows chat/text-completion deltas. Responses
API streams emit typed events (response.created, response.completed, ...)
whose final event carries the full ResponsesAPIResponse, so stream_chunk_builder
would raise APIError or pass the assembled response through unchanged.

Detect Responses API streaming chunks before the chat/text fallthrough and
extract the assembled ResponsesAPIResponse from the latest response.completed
(or response.failed / response.incomplete) event, then scan its output_text
via the same _completion_response_text_parts path used by non-streaming.

* fix(purview_dlp): fail closed on incomplete Responses API streams

Previously, _assemble_responses_api_from_chunks returned None both when the
stream was not a Responses API stream and when it was a Responses API stream
but no final ResponsesAPIResponse-bearing event was received. The caller
treated both cases identically and fell through to stream_chunk_builder,
which does not understand Responses API events.

Return a (is_responses_api_stream, assembled) tuple so the caller can fail
closed with an accurate error when Responses API events were seen but no
final response event arrived, instead of misrouting events to the chat
chunk builder.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(purview_dlp): wrap upstream DLP errors as HTTPException(400) in blocking mode

Previously a bare `raise` in `_check_content` re-propagated raw network /
HTTP errors (e.g. httpx.HTTPStatusError, ConnectionError) to the client,
which would surface as a 500. Now blocking-mode failures from the Graph
`processContent` call (and OAuth token / protection-scopes calls) are
converted to HTTPException(400) with a structured detail payload, while
HTTPException instances raised by upstream layers continue to propagate
unchanged. Logging-only mode is unaffected.

* test(purview_dlp): cover HTTP error paths for token + Graph POST

* fix(purview): include Responses API function_call arguments in DLP scan

ResponsesAPIResponse.output_text only aggregates output_text content blocks
and ignores function_call items, so sensitive data in model-generated
tool-call arguments would bypass the DLP scan. Mirror the ModelResponse
path by extracting function_call arguments explicitly from the output list.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(purview_dlp): scan ResponsesAPIResponse in stream_chunk_builder fallback

If a Responses API stream slips past _assemble_responses_api_from_chunks
(no chunks with type starting with 'response.') and stream_chunk_builder
somehow returns a ResponsesAPIResponse, route it through
_completion_response_text_parts instead of the 'not a ModelResponse'
pass-through that would leak content unscanned.

* fix(purview_dlp): preserve upstream Graph status on `_check_content` errors

httpx.HTTPStatusError from Graph API (429, 503, etc.) was always wrapped
as HTTPException(400), making rate-limits and infrastructure errors
indistinguishable from a DLP policy block and stripping retry-after info.

Now:
- 429 and 5xx pass through with their original status code; the upstream
  Retry-After header is forwarded.
- 401/403 (proxy-side credential/consent issue, not actionable by the
  client) map to 502 Bad Gateway.
- A debug log makes the logging_hook -> async_logging_hook deferral
  observable so audit failures don't silently disappear if the framework
  stops dispatching async_logging_hook for some code path.

* fix(purview_dlp): reject nested token-id completion prompts

OpenAI /v1/completions accepts prompt: [[token, ids]] (multi-prompt
token-id batches). The previous blocking-mode check only fired on a
flat list[int], so nested or mixed token-id prompts skipped the
Purview scan while the model still received the data.

Extract the token-id detection into PurviewGuardrailBase.is_token_id_prompt
and use it from the pre-call hook so every list shape Purview cannot
decode fails closed.

* fix(purview): drop caller-influenceable identity fallbacks in audit resolver

Logging-only hook now resolves the Purview user from only the
proxy-injected user_api_key_user_id (which mirrors UserAPIKeyAuth.user_id
after the proxy strips caller-supplied user_api_key_* keys).  Skipping
the audit when no trusted identity is available prevents a caller from
submitting metadata.user_id pointing at a victim's Entra object id and
having their prompt/response sent to Purview under that user's
identity.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Yuneng Jiang <yuneng@berri.ai>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
Co-authored-by: claude <claude@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Kenan Yildirim <kenan@kenany.me>
Co-authored-by: vladpolevoi <vladp@lasso.security>
Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>
Co-authored-by: ryan-crabbe-berri <ryan@berri.ai>
Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com>
Co-authored-by: Dennis Henry <dennis.henry@okta.com>
Co-authored-by: veria-ai[bot] <224490171+veria-ai[bot]@users.noreply.github.com>
Co-authored-by: ishaan-berri <155045088+ishaan-berri@users.noreply.github.com>
Co-authored-by: Ishaan Jaffer <ishaanjaffer0324@gmail.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: João Costa <13508071+jpv-costa@users.noreply.github.com>
Co-authored-by: oss-agent-shin <ext-agent-shin@berri.ai>
Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com>
Co-authored-by: Michael-RZ-Berri <michael@berri.ai>
Co-authored-by: Shivam Rawat <shivam@berri.ai>
Co-authored-by: Tai An <antai12232931@outlook.com>
Co-authored-by: Vincent <yimao1231@gmail.com>
Co-authored-by: Kris Xia <xiajiayi0506@gmail.com>
Co-authored-by: d 🔹 <liusway405@gmail.com>
Co-authored-by: Fabrizio Cafolla <developer@fabriziocafolla.com>
Co-authored-by: Tom Denham <tom@tomdee.co.uk>
Co-authored-by: escon1004 <70471150+escon1004@users.noreply.github.com>
Co-authored-by: Divyansh Singhal <97736786+Divyansh8321@users.noreply.github.com>
Co-authored-by: robin-fiddler <robin@fiddler.ai>
Co-authored-by: Michael Riad Zaky <michaelr@Mac.localdomain>
Co-authored-by: Krrish Dholakia <krrish+github@berri.ai>
Co-authored-by: Noah Nistler <60981020+noahnistler@users.noreply.github.com>
Co-authored-by: Felipe Rodrigues Gare Carnielli <felipe.gare@hotmail.com>
Co-authored-by: harish-berri <harish@berri.ai>
Co-authored-by: milan-berri <milan@berri.ai>
Co-authored-by: Ryan <ryan@Ryans-MBP.localdomain>
Co-authored-by: Claude (greptile subagent) <claude-greptile-bot@anthropic.com>
Co-authored-by: TorvaldUtne <78661304+TorvaldUtne@users.noreply.github.com>
Co-authored-by: mubashir1osmani <mubashir.osmani777@gmail.com>
Co-authored-by: Isha <72744901+IshaMeera@users.noreply.github.com>
Co-authored-by: cwang-otto <chengxuan.wang@ottotheagent.com>
Co-authored-by: Roman Pushkin <roman.pushkin@gmail.com>
Co-authored-by: boarder7395 <37314943+boarder7395@users.noreply.github.com>
Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com>
Co-authored-by: Dibyo Mukherjee <dibyo@adobe.com>
Co-authored-by: Kevin Zhao <zkm8093@gmail.com>
Co-authored-by: Matthew Lapointe <lapointe683@gmail.com>
Co-authored-by: Elon Azoulay <elon.azoulay@gmail.com>
Co-authored-by: afoninsky <andrey.afoninsky@gmail.com>
Co-authored-by: Joseph Barker <156112794+seph-barker@users.noreply.github.com>
Co-authored-by: Maruti Agarwal <88403147+marutilai@users.noreply.github.com>
Co-authored-by: Cursor Bugbot <bugbot@cursor.com>
Co-authored-by: Greptile <greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Greptile Reviewer <greptile-apps@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Felipe Garé <90070734+FelipeRodriguesGare@users.noreply.github.com>
Co-authored-by: withomasmicrosoft <withomas@microsoft.com>
Co-authored-by: Aditya Singh <60082699+adityasingh2400@users.noreply.github.com>
2026-05-22 15:59:04 -07:00
yuneng-jiang
574ee7526d
test(streaming): tolerate Vertex 429 wrapped in MidStreamFallbackError (#28669)
Streaming 429s are wrapped in MidStreamFallbackError so the Router can
fall back; the existing 'except litellm.RateLimitError: pass' in
test_vertex_ai_stream no longer matches, causing the generic
pytest.fail branch to fire when upstream Vertex returns 429.

Add a sibling except for MidStreamFallbackError that only swallows it
when e.original_exception is a RateLimitError, so unrelated streaming
failures still fail the test.
2026-05-22 15:57:29 -07:00
milan-berri
1b141bc588
fix(bedrock): decouple STS region from Bedrock aws_region_name (#28245)
* fix(bedrock): decouple STS region from Bedrock aws_region_name

STS AssumeRole now resolves signing region from aws_sts_endpoint (parsed
host) or AWS_REGION/AWS_DEFAULT_REGION instead of aws_region_name, fixing
air-gapped cross-region Bedrock setups and endpoint/signature mismatches.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(bedrock): add regression coverage for _build_sts_client_kwargs

Parametrize _resolve_sts_region and _build_sts_client_kwargs matrix cases,
and assert IRSA/web-identity paths use aligned STS endpoint and region_name.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(bedrock): tighten STS region helpers and drop redundant web-identity endpoint synthesis

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(bedrock): cover FIPS, GovCloud, and China STS endpoints

Addresses greptile P2: regex sts(?:-fips)? supported sts-fips hosts but
was not exercised by the parametrized parse test.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-23 00:39:24 +03:00
Krrish Dholakia
a3c953ed4e
style: apply black formatting to fix lint CI (LIT-3274) (#28639) (#28641)
* fix(bedrock): strip bedrock/ prefix and URL-encode ARNs in get_bedrock_model_id for invoke path

The invoke path (used by /v1/messages → Anthropic SDK / Claude Code) called
get_bedrock_model_id() which, when falling back to the raw model string, did
not strip the 'bedrock/' routing prefix and did not URL-encode ARNs.

For a model like:
  bedrock/arn:aws:bedrock:us-east-1:<ACCOUNT>:inference-profile/global.anthropic...

the URL built was:
  /model/bedrock/arn:aws:bedrock:…/invoke-with-response-stream  

Bedrock returned a JSON error body.  LiteLLM's AWSEventStreamDecoder passed
those bytes into botocore's EventStreamBuffer which expects binary event-stream
framing.  Checksum validation failed on the JSON prelude (0x223a7b22 == ':{"')
producing a misleading botocore.eventstream.ChecksumMismatch instead of the
actual Bedrock error.

Fix: strip 'bedrock/' (and 'invoke/') routing prefix from model string, then
URL-encode if the result is an ARN — matching what the converse path already
does in converse_handler.py.

Fixes: LIT-3274

* fix(bedrock): use strip_bedrock_routing_prefix to handle compound prefixes

Address greptile review: the original fix used a loop with break, so
bedrock/invoke/arn:... only stripped bedrock/ leaving invoke/arn:...
which is not an ARN → fell through to .replace('invoke/','',1) →
bare unencoded ARN → same malformed-URL bug.

strip_bedrock_routing_prefix() iterates without break, correctly
stripping bedrock/ then invoke/ in sequence. Also adds test case
for the compound-prefix scenario.

* style: apply black formatting to fix lint CI (LIT-3274)

---------

Co-authored-by: oss-agent-shin <ext-agent-shin@berri.ai>
Co-authored-by: LiteLLM Bot <bot@berri.ai>
2026-05-22 12:10:37 -07:00
milan-berri
9600fda2cc
fix(sagemaker): send native Cohere embed payload to Cohere SageMaker endpoints (#28613)
* fix(sagemaker): use Cohere embed payload for Marketplace endpoints

SageMaker embedding only special-cased Voyage; every other endpoint received
HuggingFace TGI `{"inputs": [...]}`. AWS Marketplace Cohere containers expect
the native Cohere embed payload (`texts`, `input_type`) and reject the HF
shape with `422 EmbedReqV2.inputs is of type string but should be of type
Object`.

Add `SagemakerCohereEmbeddingConfig` that reuses Bedrock/Cohere request and
response transforms, and route SageMaker endpoint names containing `cohere`
or a Cohere embed model fragment (`embed-multilingual`, `embed-english`,
`embed-v3`, `embed-v4`) to it. Supports `input_type`, `dimensions`, and
`encoding_format`. Voyage and HuggingFace SageMaker endpoints are unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(sagemaker): simplify cohere detection and align with file conventions

- Detect Cohere SageMaker endpoints with a single `"cohere" in model.lower()`
  check, mirroring the existing Voyage branch instead of a separate helper
  function and marker constant.
- Drop instance caches of sub-configs; instantiate `BedrockCohereEmbeddingConfig`
  / `CohereEmbeddingConfig` per call to match the existing pattern in
  `BedrockCohereEmbeddingConfig._transform_request`.
- Match `SagemakerEmbeddingConfig`'s signatures, defaults, and `Any` typing for
  `logging_obj`; collapse the input-normalization helper inline.
- Inline `transform_embedding_response` input lookup; no behavior change.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(sagemaker): restore provider-supported embedding params after map

Cohere input_type is advertised in get_supported_openai_params but was
filtered out of non_default_params by OPENAI_EMBEDDING_PARAMS before
map_openai_params ran. Merge supported params from passed_params after
map (same path Greptile flagged). Handle input_type explicitly in
SagemakerCohereEmbeddingConfig.map_openai_params and add an integration
test through get_optional_params_embeddings.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(embeddings): only restore non-OpenAI supported params after map

The post-map restore loop must skip OPENAI_EMBEDDING_PARAMS so mapped
fields (e.g. dimensions -> output_dimension) are not duplicated under
their OpenAI names. Align SageMaker embedding import order with sibling
files and add a regression test for dimensions mapping.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(sagemaker): avoid double post_call on Cohere embedding response

Greptile review on #28613 caught that `CohereEmbeddingConfig._transform_response`
calls `logging_obj.post_call` internally. The SageMaker embedding handler
already calls `post_call` once before invoking the transform, so the Cohere
SageMaker path fired callbacks, cost calculators, and log handlers twice
per request.

Extract the parsing body of `_transform_response` into
`_populate_embedding_response` (pure extract-method, no behavior change
for existing Cohere direct or Bedrock Cohere paths, which keep calling
`_transform_response`). Have `SagemakerCohereEmbeddingConfig` call the
new helper directly so it parses the response without re-logging.

Add a regression test asserting `logging_obj.post_call` is not invoked
by the SageMaker Cohere transform.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 12:00:42 -07:00
ryan-crabbe-berri
643989989f
chore(test): remove dead old Playwright e2e suite (#28632)
The Playwright suite under tests/proxy_admin_ui_tests/e2e_ui_tests/ is no
longer wired into CI (only test_*.py is globbed) and every active spec is
duplicated by ui/litellm-dashboard/e2e_tests/tests/ (login, auth redirect,
search users, internal user list). team_admin.spec.ts was entirely
commented out. Removing the directory plus its only-used-here playwright
config, package.json/lock, and utils/login.ts keeps the canonical suite
under ui/litellm-dashboard/e2e_tests/ as the single source of truth.
2026-05-22 11:29:17 -07:00
yuneng-jiang
f62ae93e13
test(proxy): behavior-pinning matrix for tier-2/3 key + team management endpoints (#28620)
* test(proxy): add create_scratch_actor harness helper

Adds create_scratch_actor() to the management behavior-suite conftest and
extends create_scratch_team() with team_member_permissions / models kwargs,
needed by the PR3 team-key-permission and team-model matrices. The new
helper mints a scratch-prefixed user + verification token (+ org
memberships), all reclaimed by the existing scratch-prefix teardown.

* test(proxy): pin /key block, unblock, health, aliases behavior

Adds behavior-pinning matrices for POST /key/block, POST /key/unblock,
POST /key/health, and GET /key/aliases. Pins that the management-route gate
401s ORG_ADMIN-role callers before _check_key_admin_access runs, the
block/unblock round-trip on the blocked column, missing-key 404, and the
_apply_non_admin_alias_scope visibility rules for /key/aliases.

* test(proxy): pin /key/bulk_update + /team/key/bulk_update behavior

Adds behavior-pinning matrices for POST /key/bulk_update (PROXY_ADMIN-only;
ORG_ADMIN stopped 401 at the route gate, INTERNAL_USER-role 403 at the
handler) and POST /team/key/bulk_update (team-member-permission gate keyed
on KEY_UPDATE). Pins batch semantics: empty/over-cap 400, per-key failure
isolation into failed_updates, all_keys_in_team broadcast, and no-keys 404.
Adds an optional key_alias arg to create_scratch_key for multi-key scenarios.

* test(proxy): pin /key SA-generate, v2-info, reset-spend behavior

Adds behavior-pinning matrices for POST /key/service-account/generate
(team-membership + team-member-permission gating; SA keys carry no user_id),
POST /v2/key/info (per-key _can_user_query_key_info silently drops invisible
keys), and POST /key/{key}/reset_spend (PROXY_ADMIN or team admin only;
missing key 404, reset-value 400). Pins that ORG_ADMIN-role callers are
stopped 401 at the management-route gate on the two non-info routes.

* test(proxy): close PR1/PR2 key-side deferred coverage gaps

Closes the four key-side gaps deferred from PR1/PR2:
- 404 on missing key for /key/update and /key/delete (not 401/403)
- denied /key/update leaves max_budget/tpm_limit/rpm_limit untouched
- /key/regenerate enforces litellm.upperbound_key_generate_params (#26340)
- /key/list key_alias substring vs exact (admin-only) + team_id filter,
  and a non-admin filtering a foreign team is 403

* test(proxy): pin /team block, unblock, available, filter/ui, members/me

Adds behavior-pinning matrices for POST /team/block + /team/unblock
(management-route gate fronts _verify_team_access; reachable only by
PROXY_ADMIN and an org admin of the team's own org), GET /team/available
(default empty path), GET /team/filter/ui (route-gated PROXY-ADMIN-only
despite the handler having no gate), and GET /team/{team_id}/members/me
(caller resolves its own membership; non-member 404, no-user_id key 400).

* test(proxy): pin /team model add/delete + permissions endpoints

Adds behavior-pinning matrices for POST /team/model/add + /team/model/delete
(route-gated PROXY-ADMIN-only; missing team 404), GET /team/permissions_list +
POST /team/permissions_update (self-managed; proxy/team/org admin pass), and
POST /team/permissions_bulk_update (PROXY_ADMIN-only). Pins the deliberate
divergence that the available-team self-join grants read access via
permissions_list but never write access via permissions_update.

* test(proxy): pin /team delete, bulk_member_add, v2/list, daily/activity

Adds behavior-pinning matrices for POST /team/delete (per-team
_verify_team_access; batch aborts whole on a missing id), POST
/team/bulk_member_add (route-gated PROXY-ADMIN-only; empty/over-cap 400),
GET /v2/team/list (_enforce_list_team_v2_access — bare query 401s regular
users, org-scoped for org admins) and GET /team/daily/activity (non-member
team_ids filter 404, the VERIA-43 fix).

* test(proxy): add route-coverage gate + close team org-relocation gap

Adds test_route_coverage.py (PR3.M1): parses every @router route literal
from the two management-endpoint source files and asserts each is exercised
by >=1 behavior-suite scenario — a permanent regression guard for future
routes. Closes the last PR1/PR2 deferred gap: the /team/update org-relocation
allowed branch, exercised by a dual-org-admin minted via create_scratch_actor.
test_team_model uses literal route URLs so the coverage parser resolves them.

* test(proxy): bound plain route params to one path segment in coverage gate

Plain path params ({team_id}) now compile to [^/?]+ instead of [^?]+, so a
parameter cannot span '/'. Starlette ':path' params still match across '/'.
Keeps the route-coverage guard from falsely reporting a future multi-segment
route as covered. All 37 routes remain covered.
2026-05-22 11:24:41 -07:00
yuneng-jiang
985574b6be
fix(check_licenses): read PEP 639 license-expression metadata (#28529)
The dependency license checker only read the legacy free-text
`info.license` field from PyPI. Packages that adopt PEP 639 publish
their license as an SPDX expression in `info.license_expression` and
leave the legacy field null, so the checker reported "Unknown license"
and failed CI for every newly-bumped PEP 639 dependency.

`get_package_license_from_pypi` now resolves the license in order:
`license_expression`, then legacy `license`, then the
`License :: OSI Approved :: ...` trove classifiers.

`is_license_acceptable` splits compound SPDX expressions on the
uppercase OR/AND operators (case-sensitive, so the lowercase
`-or-later` inside an identifier is not mistaken for an operator) and
strips `WITH <exception>` suffixes, requiring every component to be
acceptable. Free-text license blobs are detected and fall back to the
original whole-string matching.

The `black` and `pydantic-settings` entries in liccheck.ini that
existed solely to work around this now resolve correctly on their own
and have been removed.
2026-05-22 11:22:38 -07:00
Mateo Wang
b0b25ae4b9
Include team alias in CLI JWT token (#28621) 2026-05-22 10:40:59 -07:00
Sameer Kankute
e9f0eddbd1
Litellm oss staging 2 (#28582)
* fix(anthropic): handle empty streaming tool calls (#28549)

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>

* [Feature][Bug Fix] Decouple Azure OpenAI Deployment ID from model name via base_model to fix gpt5 model routing (#28490)

* feat(azure): decouple deployment ID from model name via base_model

Azure OpenAI deployments have arbitrary names (deployment IDs) that may
not match the underlying model. Previously, model-type detection
(o-series, gpt-5, etc.) relied on substring matching against the
deployment name, causing misrouted configs and rejected params when
deployment names were non-standard (e.g. 'my-deployment-id' for gpt-5.2).

This change extends the existing base_model field to drive model-type
detection, config selection, supported param resolution, and param
mapping throughout the Azure call path:

- _get_azure_config() uses base_model for is_o_series/is_gpt_5 checks
- get_provider_chat_config() threads base_model for Azure
- get_supported_openai_params() accepts and uses base_model
- get_optional_params() accepts base_model and passes it to all Azure
  config method calls (get_supported_openai_params, map_openai_params)
- azure.py completion handler uses base_model for GPT-5 detection
- Config internal methods (e.g. is_model_gpt_5_2_model) now receive
  base_model so features like logprobs are correctly enabled

Fully backward compatible - when base_model is unset, behavior is
identical. Existing o_series/ and gpt5_series/ prefix workarounds
continue to work.

Usage in proxy config:
  model_list:
    - model_name: my-gpt5
      litellm_params:
        model: azure/my-deployment-id
      model_info:
        base_model: azure/gpt-5.2

Fixes: non-standard deployment names like 'prefix-gpt-5.2' rejecting
logprobs/top_logprobs despite the underlying model supporting them.

* Addressing Greptile comments.

* gemini-3.1-flash-lite pricing (#27933)

* feat(model_prices): add gemini-3.1-flash-lite pricing with standard/batch/flex/priority tiers

* fix pricing

* add service tier

---------

Co-authored-by: shin-berri <shin-laptop@berri.ai>

* fix(openai-responses): strip Anthropic cache_control from Responses API requests (#28431)

Squash-merged by litellm-agent from cwang-otto's PR.

* Treat None litellm_provider as wildcard in _check_provider_match (#28523)

Squash-merged by litellm-agent from adityasingh2400's PR.

* fix greptile

* fix: use _azure_detection_model in default Azure branch of get_supported_openai_params

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(openai-responses): strip cache_control on compact endpoint as well

Co-authored-by: Yassin Kortam <yassin@berri.ai>

---------

Co-authored-by: Felipe Garé <90070734+FelipeRodriguesGare@users.noreply.github.com>
Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: withomasmicrosoft <withomas@microsoft.com>
Co-authored-by: mubashir1osmani <mubashir.osmani777@gmail.com>
Co-authored-by: cwang-otto <chengxuan.wang@ottotheagent.com>
Co-authored-by: Aditya Singh <60082699+adityasingh2400@users.noreply.github.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
2026-05-22 10:04:23 -07:00
Sameer Kankute
21a21e01f7
fix(responses): use OpenAI SSEDecoder for Responses API streaming (#28566)
* fix(responses): use OpenAI SSEDecoder for Responses API streaming

httpx aiter_lines() uses str.splitlines(), which splits on U+2028 inside
JSON payloads and silently drops response.completed (no spend log). Use
openai._streaming.SSEDecoder (bytes.splitlines before decode) instead.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(responses): drop redundant SSE prefix strip after SSEDecoder switch

SSEDecoder already strips the 'data:' field prefix from each event, so the
extra call to _strip_sse_data_from_chunk on sse.data was redundant and could
incorrectly mangle payloads whose actual content starts with 'data:'.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
2026-05-22 10:03:36 -07:00
Sameer Kankute
50a3f10a92
feat(proxy): persist allowlisted OIDC claims in CLI SSO poll (#28463)
* feat(proxy): persist allowlisted OIDC claims in CLI SSO poll

Map CLI_SSO_CLAIM_MAP sources into user metadata and return scalar
attribution_metadata from /sso/cli/poll. Build SSOUserDefinedValues in
cli_sso_callback so first-time CLI logins can upsert users. Add mock OIDC
scripts and tests for claim extraction and poll exposure.

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(proxy): document CLI SSO attribution_metadata in client README

Co-authored-by: Cursor <cursoragent@cursor.com>

* Delete scripts/mock_oidc_server_for_cli_sso.py

* Delete scripts/test_cli_sso_claims_e2e.py

* fix(ui_sso): preserve claim types and avoid metadata. prefix stripping

- Replace _update_dictionary with a local recursive merge so string
  OIDC claim values that happen to look numeric are not silently coerced
  to int/float when persisting CLI SSO attribution metadata.
- Use a local dot-path resolver in _extract_sso_claim_value so that
  source claim paths beginning with 'metadata.' are not silently stripped
  by get_nested_value (which is designed for LiteLLM JWT metadata, not
  arbitrary OIDC claims).

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* Remove redundant metadata. prefix strip in _set_nested_metadata_value

The _parse_cli_sso_claim_map already strips the metadata. prefix from
dest keys before reaching the setter. The duplicate strip in
_set_nested_metadata_value was a no-op in normal flow but could
mis-place values for dest keys like metadata.metadata.foo.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* Fix greptile

* Fix ruff

* Move CLI SSO user defined values build inside try/except for consistent error handling

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(proxy): enforce restricted SSO group on CLI SSO callback

Apply verify_user_in_restricted_sso_group before CLI session completion
and user upsert, matching the UI SSO path. Re-raise ProxyException so
restricted-group denials return 403 instead of 500.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(proxy): replace recursive CLI SSO metadata helpers with iterative merge

Use stack-based flatten/merge to satisfy recursive_detector CI. Fix mypy
types for UserApiKeyCache and user_id on CLI SSO session completion.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix: resolve nested CustomOpenID extra_fields in CLI SSO claim extraction

When GENERIC_USER_EXTRA_ATTRIBUTES captures a parent object (e.g. org_info),
extra_fields stores it as {"org_info": {"department": "..."}}. A CLI claim
map entry using a dotted path like org_info.department would silently fail
because the lookup only checked the exact flat key. Fall back to dotted-path
resolution on extra_fields before model_dump().

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(sso): update CLI SSO test for new received_response kwarg and remove redundant 'token' secret fragment

Co-authored-by: Yassin Kortam <yassin@berri.ai>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
2026-05-22 09:58:50 -07:00
Sameer Kankute
ef36e89638
feat(mcp): Add tool call and tool list support via UI for Oauth mcps (#28454)
* feat(mcp): cache OAuth token client-side so Tools tab loads without re-auth

After a user creates an OAuth MCP server and completes the authorization
flow, the resulting access token is now stored in sessionStorage keyed by
server_id.  The MCP Tools tab reads this cached token and includes it as
an MCP auth header when listing and invoking tools, so the user never sees
an empty tool list.  When the session ends (tab close / new browser) an
Authorize button re-triggers the flow without leaving the Tools screen.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix(ui/mcp): surface listMCPTools 401 errors so auth gate reappears

listMCPTools previously swallowed all errors (including HTTP 401) by
returning a synthetic { tools: [], error: 'network_error', ... } payload.
That made the useQuery retry-on-401 guard and mcpToolsError dead code,
so expired OAuth tokens never re-triggered the auth gate.

- Throw an enhanced Error with .status attached on non-2xx responses
  (still preserves the legacy shape for true network failures so the
  caller can render a generic message without crashing).
- Clear the cached OAuth session token when the tools query fails with
  401, mirroring callMCPTool's onError handler so the Authorize button
  is shown again.
- Surface mcpToolsError in the existing error banner.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(mcp-tools): stable onSuccess + reuse parsed flow state

- Pass stable setOauthToken setter directly as onSuccess to avoid
  recreating useToolsOAuthFlow's resumeOAuthFlow on every render.
- Reuse the already-parsed FLOW_STATE_KEY value (peeked) instead of
  re-reading and re-parsing sessionStorage in resumeOAuthFlow.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(ui/mcp): restore listMCPTools never-throws contract

The previous fix made listMCPTools throw on HTTP errors while still
returning a synthetic object on network errors. This inconsistent
contract broke existing callers (MCPToolPermissions, MCPAppsPanel,
MCPConnectPicker) which inspect result.error / result.message and
expect the function to never throw.

- Return a normalized { tools: [], error, message, status, ... }
  object on HTTP errors (instead of throwing) so all callers see a
  consistent shape and the user-visible error text from
  result.message is preserved.
- Convert the returned error object into a thrown Error inside the
  one caller that needs it — the useQuery in mcp_tools.tsx — so the
  401 retry/onError handlers still trigger and clear the cached
  OAuth token.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix greptile

* fix(mcp): align OAuth header alias lookup with dashboard sanitization

Backend auth header resolution now matches x-mcp-{alias} keys produced by
the dashboard sanitizer, and the Tools tab re-syncs OAuth tokens when
serverId changes.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mcp): widen auth header lookup types for list_tools

Accept legacy str | dict server auth maps and annotate list_tools
server_auth_header as Union[str, dict] for mypy.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(ui): extract shared buildCallbackUrl/clearStorage for MCP OAuth hooks

Hoist the duplicate buildCallbackUrl and clearStorage helpers out of
useToolsOAuthFlow and useUserMcpOAuthFlow into a new shared module
src/hooks/mcpOAuthUtils.ts so the two hooks cannot drift if the URL
construction or storage cleanup logic needs to change.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(ui): don't gate M2M OAuth MCP servers behind interactive authorize

M2M (client_credentials) OAuth servers share auth_type="oauth2" with
interactive PKCE servers, but the backend fetches their token internally
and they typically lack a user authorization endpoint. Gating tool
listing on them rendered an Authorize button that would fail or redirect
incorrectly. Detect M2M via the presence of token_url (matching the
existing heuristic in mcp_server_edit.tsx) and skip the auth gate.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(ui/mcp): return error shape when listMCPTools JSON parse fails

Restore the never-throws contract when response.json() fails on a 2xx
body so callers do not receive null and crash on result.tools.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
2026-05-22 09:04:04 -07:00
Sameer Kankute
7a93cceb9f
Add error_description and hint for oauth flows (#28471)
* Add error_description and hint for oauth flows

* Fix tests

* fix(mcp-oauth): improve redirect_uri errors without leaking internal config

Use NoReturn on _oauth_invalid_request, structured errors for BYOK loopback
validation, and refactor validate_trusted_redirect_uri to satisfy PLR0915.
Keep PROXY_BASE_URL and raw proxy_base_url in server logs only, not in the
HTTP 400 body returned to unauthenticated callers.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mcp-oauth): stop leaking internal proxy origin in redirect_uri 400 body

The trusted-redirect-uri rejection helper included the proxy's
resolved scheme/host/port (e.g. http://litellm-internal:4000) in both
the error_description and as a top-level proxy_origin field. Since
the OAuth /authorize endpoint is unauthenticated, any caller could
probe with a crafted redirect_uri and enumerate the internal network
topology behind a reverse proxy.

Keep full diagnostic detail in the server-side warning log
(including the computed proxy base) but omit proxy-side values from
the HTTP 400 body. Also drop the duplicated origin computation in
_raise_trusted_redirect_uri_rejected now that those values are no
longer needed by the response.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(mcp-oauth): remove dead userinfo check in redirect_uri validation

The first check combined missing netloc with userinfo presence, making
the second userinfo-only check unreachable. Split into two distinct
checks so each error message reflects the actual failure mode.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
2026-05-22 09:00:39 -07:00
Sameer Kankute
d96e26064f
Fix conflicts and UI (#28477) 2026-05-22 08:55:28 -07:00
harish-berri
d04373f4ce
Add granian as a ASGI compliant web server. Provider better throughput stability, (#26027)
* Add granian as a ASGI compliant web server. Provides better stability, 10-20 RPS improvement under standard LT conditions.

TODO: Verify poetry lock details and add locust numbers to PR

* Update granian version in license_cache.json and pyproject.toml to 2.5.7

* Enhance proxy CLI tests by adding SSL initialization checks for Granian server. Remove Python version skip conditions and implement tests to ensure SSL certificate and key are required for server initialization.

* update uv lock to fix granian import error
2026-05-21 19:08:37 -07:00
ryan-crabbe-berri
07bcd2c19e
test(e2e): forward LITELLM_LICENSE to UI e2e proxy (#28398)
* test(e2e): forward LITELLM_LICENSE to UI e2e proxy

The UI e2e job ran without LITELLM_LICENSE, so premium_user was always
false in the issued login JWT and premium-gated UI surfaces (Team-BYOK
Model switch, etc.) couldn't be driven through the UI. Forward the env
var from run_e2e.sh and the CircleCI e2e_ui_testing job, and add a
sanity test that decodes the admin storage state token and asserts
premium_user=true so the wiring fails loudly if it ever regresses.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Update ui/litellm-dashboard/e2e_tests/tests/proxy-admin/license.spec.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2026-05-21 18:17:03 -07:00
yuneng-jiang
9fcd424318
chore(deps): bump deps (#28528)
* build(deps): bump next from 16.2.4 to 16.2.6 in /ui/litellm-dashboard (#27665)

Bumps [next](https://github.com/vercel/next.js) from 16.2.4 to 16.2.6.
- [Release notes](https://github.com/vercel/next.js/releases)
- [Changelog](https://github.com/vercel/next.js/blob/canary/release.js)
- [Commits](https://github.com/vercel/next.js/compare/v16.2.4...v16.2.6)

---
updated-dependencies:
- dependency-name: next
  dependency-version: 16.2.6
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump protobufjs in /tests/pass_through_tests (#28296)

Bumps [protobufjs](https://github.com/protobufjs/protobuf.js) from 7.5.6 to 7.6.0.
- [Release notes](https://github.com/protobufjs/protobuf.js/releases)
- [Changelog](https://github.com/protobufjs/protobuf.js/blob/protobufjs-v7.6.0/CHANGELOG.md)
- [Commits](https://github.com/protobufjs/protobuf.js/compare/protobufjs-v7.5.6...protobufjs-v7.6.0)

---
updated-dependencies:
- dependency-name: protobufjs
  dependency-version: 7.6.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump ws from 8.20.0 to 8.20.1 in /tests/pass_through_tests (#28303)

Bumps [ws](https://github.com/websockets/ws) from 8.20.0 to 8.20.1.
- [Release notes](https://github.com/websockets/ws/releases)
- [Commits](https://github.com/websockets/ws/compare/8.20.0...8.20.1)

---
updated-dependencies:
- dependency-name: ws
  dependency-version: 8.20.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-22 00:42:21 +00:00