litellm/tests/enterprise
Mateo Wang 13924fa1d6
feat: standardize rate limit errors with category, rate_limit_type, model, and llm_provider fields (#27687)
* feat(exceptions): add RateLimitErrorCategory + headers/detail fields on RateLimitError

LiteLLM previously surfaced rate-limit conditions through several unrelated
error classes (RateLimitError, FastAPI HTTPException(429), BaseLLMException).
This commit adds the data model needed to consolidate them under a single
class:

* RateLimitErrorCategory enum exposing four categorical values
  (vendor_rate_limit, vendor_batch_rate_limit, litellm_rate_limit,
  litellm_batch_rate_limit) so callers can switch on the rate-limit source.
* New optional fields on RateLimitError:
  - category (defaults to vendor_rate_limit, preserving today's behavior for
    every existing call site in exception_mapping_utils);
  - headers (preserves retry-after / rate_limit_type / reset_at across the
    proxy boundary instead of dropping them on the floor);
  - detail (mirrors FastAPI HTTPException.detail so the same instance can be
    serialized through both paths).

litellm.RateLimitErrorCategory is re-exported at the package root to match
the existing exception-export pattern.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* feat(proxy): add ProxyRateLimitError unifying RateLimitError + HTTPException

Adds a single proxy-side error class that subclasses BOTH
litellm.exceptions.RateLimitError AND fastapi.HTTPException via cooperative
multiple inheritance.

Why both bases:
* Subclassing RateLimitError lets user code catch every rate-limit source
  with one 'except RateLimitError' and switch on the new .category field.
* Subclassing HTTPException keeps every existing FastAPI plumbing path (the
  isinstance(e, HTTPException) branches in proxy_server.py route handlers,
  FastAPI's own dispatcher, and tests asserting pytest.raises(HTTPException))
  working without modification, and preserves retry-after / rate_limit_type /
  reset_at headers on the wire.

The class declaration order is (HTTPException, RateLimitError) so the MRO
puts HTTPException's no-super-call __init__ ahead of openai's cooperative
__init__ chain — preventing openai.APIError.super().__init__(message) from
landing in HTTPException.__init__(status_code=message).

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* refactor(proxy/hooks): raise ProxyRateLimitError from budget + iteration limiters

Replaces three bare HTTPException(status_code=429, ...) call sites with
ProxyRateLimitError, which is both a RateLimitError (catchable by category)
and an HTTPException (preserves existing FastAPI serialization). Drops the
now-unused HTTPException import in the iteration / per-session limiters.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* refactor(proxy/hooks): raise ProxyRateLimitError from parallel-request limiters

Replaces HTTPException(status_code=429, ...) call sites in the v1 and v3
parallel-request limiters (key/team/user/model/customer rate limits) with
ProxyRateLimitError. Updates the raise_rate_limit_error helper's return type
annotation accordingly.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* refactor(proxy/hooks): raise ProxyRateLimitError from dynamic rate limiters

Replaces HTTPException(status_code=429, ...) call sites in the v1 and v3
dynamic rate limiters (project-level TPM/RPM allocation, model-saturation
checks, priority-based limits, fail-closed guards) with ProxyRateLimitError.
The v3 limiter still imports HTTPException for an unrelated bare 'except
HTTPException:' branch.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* refactor(proxy/hooks): raise ProxyRateLimitError from batch rate limiter

Replaces HTTPException(status_code=429, ...) in batch_rate_limiter._raise_rate_limit_error
with ProxyRateLimitError tagged as RateLimitErrorCategory.LITELLM_BATCH_RATE_LIMIT
so users can distinguish batch-level throttling (which counts requests/tokens
across an uploaded batch input file before submission) from the generic
key/team/user RPM/TPM limiter.

The HTTPException import is retained because the same module raises
HTTPException for unrelated 403/IO error paths.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(rate-limit): pin down unified rate-limit error contract

Adds a dedicated test module covering the new RateLimitErrorCategory enum,
RateLimitError.category default + override behavior, ProxyRateLimitError's
dual nature (RateLimitError + HTTPException), and a parametrized regression
guard that asserts every proxy hook module imports the unified class.

The regression guard catches the failure mode the refactor is designed to
prevent: someone re-introducing a bare HTTPException(status_code=429, ...)
in one of the hook modules instead of going through ProxyRateLimitError.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* feat(logging): expose rate-limit category via StandardLoggingPayload

Adds an optional 'error_rate_limit_category' field to
StandardLoggingPayloadErrorInformation, populated from the unified
RateLimitError.category attribute (introduced in the previous commits on
this branch).

Why: the .category attribute is reachable off the raw exception today via
getattr(e, 'category', None), but the structured contract that downstream
custom callbacks / loggers / spend log writers consume is the
StandardLoggingPayload. Without this field, a user building custom
rate-limit metrics on top of callback data has to special-case the raw
exception object — which defeats the purpose of the StandardLoggingPayload
abstraction.

The field is None for non-rate-limit exceptions (so consumers can read it
unconditionally without isinstance checks) and is one of the
RateLimitErrorCategory string values otherwise.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(rate-limit): assert StandardLoggingPayload carries the category

Five tests covering: vendor default, explicit litellm_rate_limit and
litellm_batch_rate_limit values, None for non-rate-limit exceptions, and
None when no exception is provided. Pins down the contract that custom
callbacks can read 'error_information.error_rate_limit_category' off the
StandardLoggingPayload to drive custom rate-limit metrics without ever
reaching for the raw exception.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(types): silence mypy [misc] on intentional dual-base attr overlap

mypy emits two [misc] errors on the ProxyRateLimitError class line because
its two bases declare overlapping attributes with related-but-not-identical
annotations:

* status_code: int on starlette HTTPException vs. Literal[429] on openai's
  RateLimitError (every openai status-error subclass narrows it the same
  way and silences pyright with the same convention).
* headers: Mapping[str, str] | None on HTTPException vs. our Optional[
  Dict[str, str]] (the proxy hooks always carry a stringified dict).

Both narrowings are intentional and enforced at construction time. Add a
type: ignore[misc] with an inline explanation rather than relax the
annotations on the parent or change the wire-format guarantees.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(rate-limit): add direct hook-invocation tests to lift patch coverage

Adds six end-to-end tests that drive each refactored hook past its
limit and assert the unified ProxyRateLimitError is raised with the
correct category and dual-base shape. Complements the
import-shape-only parametrized guard above by actually executing the
new 'raise ProxyRateLimitError(...)' lines so codecov's patch coverage
sees them as hit.

Hooks covered (one test each):
* parallel_request_limiter v1 — direct call to raise_rate_limit_error()
* parallel_request_limiter v3 — direct call to _handle_rate_limit_error
  with a fabricated OVER_LIMIT response
* max_iterations_limiter — full async_pre_call_hook with mocked agent
  registry, second call exceeds budget=1
* max_budget_limiter — async_pre_call_hook with mocked get_current_spend
* dynamic_rate_limiter v1 — async_pre_call_hook with mocked
  check_available_usage forcing available_tpm == 0
* batch_rate_limiter — direct _raise_rate_limit_error call, asserts
  category is the batch-specific LITELLM_BATCH_RATE_LIMIT (not the
  generic LITELLM_RATE_LIMIT)

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix: guard rate_limit_category extraction with isinstance check

* test(rate-limit): cover remaining hook raise sites for codecov

Adds five more direct hook-invocation tests so every PR-touched line
in the proxy hooks is exercised by tests in tests/test_litellm/, which
codecov measures:

* parallel_request_limiter v1 — check_key_in_limits inline raise
  (the second raise site, separate from the raise_rate_limit_error
  helper covered earlier)
* dynamic_rate_limiter v1 — RPM raise branch (TPM branch was already
  covered)
* dynamic_rate_limiter v3 — parametrized over all three raise sites:
  model_saturation_check, priority_model, and the fail-closed
  fallback for an unrecognized descriptor_key
* max_budget_per_session_limiter — full async_pre_call_hook with a
  mocked agent registry and over-budget cached spend

All 42 tests in test_rate_limit_error_unification.py now pass and
together exercise every changed import + raise line across the eight
refactored proxy hooks.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix: use computed error_message in ProxyRateLimitError detail

* fix(parallel-request-limiter): drop None from detail; annotate raise_rate_limit_error as NoReturn

The v1 ' raise_rate_limit_error' helper built an unused 'error_message'
variable and then assembled the actual ' detail' via an f-string that
interpolated 'additional_details' verbatim — producing
'Max parallel request limit reached None' when invoked without
arguments (flagged by code review).

Fix the helper to:
- use the constructed 'error_message' as the detail
- annotate the helper as NoReturn since it always raises
- drop the redundant 'raise'/'return' at the two call sites

Add two regression tests covering both the with- and without-
additional_details paths.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(proxy/hooks): drop literal 'None' from raise_rate_limit_error detail

The v1 parallel_request_limiter's raise_rate_limit_error helper has a
long-standing bug: it computes a None-guarded 'error_message' string but
then ignores it and emits an f-string that interpolates the raw
'additional_details' arg. Callers that pass no argument get
'Max parallel request limit reached None' as the user-facing detail.

This commit:
* wires error_message into the detail kwarg so the None-guard actually
  applies and operators see a clean message;
* changes the return-type annotation from ProxyRateLimitError to NoReturn
  (the function always raises) so type-checkers know callers after this
  invocation are unreachable.

Greptile P1 + P2 review feedback on PR #27687.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(types): demote TypedDict floating string to a # comment

A string literal placed after a field declaration in a TypedDict body is
not a per-field docstring — it's an orphaned string expression Python
discards. Tools like mypy / pyright that inspect TypedDict fields won't
surface that text either.

Move the documentation for error_rate_limit_category to a real comment
so the intent is visible to readers and type-checker tooling without
the misleading docstring framing.

Greptile P2 review feedback on PR #27687.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* security(exceptions): do not auto-copy vendor response headers to e.headers

A vendor 429 response can set arbitrary headers (Set-Cookie, CORS
overrides, …). Previously, when RateLimitError was constructed with only
a 'response=' (no explicit 'headers=' kwarg), self.headers fell back to
a copy of response.headers. If a downstream proxy serializer ever
forwarded e.headers to the client, a malicious upstream could inject
browser-interpreted headers for the proxy origin.

Drop the fallback. Only headers passed explicitly via the headers= kwarg
make it onto self.headers (proxy hooks pass retry-after etc. — they
control what's surfaced). Vendor response headers stay reachable on
e.response.headers for callers that explicitly want them.

Today's proxy_server.py route handlers don't actually forward e.headers
on the wire (they construct ProxyException without passing headers), so
no current behavior changes — this is a defensive narrowing so the
fallback can never be turned into a vector when someone wires
e.headers through later.

Veria-AI security review feedback on PR #27687.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(rate-limit): regression guards for review-pass fixes

Pins down the three review-pass fixes:

* test_parallel_request_limiter_v1_helper_no_additional_details — calls
  raise_rate_limit_error() with no args and asserts the detail does NOT
  contain the literal string 'None'. Pre-fix, callers got 'Max parallel
  request limit reached None'.
* test_rate_limit_error_does_not_auto_copy_response_headers — passes a
  vendor httpx.Response with a Set-Cookie header to RateLimitError
  WITHOUT an explicit headers= kwarg, asserts self.headers stays None
  (no leak), then re-checks that an explicit headers= kwarg DOES
  populate self.headers. Vendor headers remain reachable on
  e.response.headers for callers that explicitly want them.
* The existing v1-helper test now also asserts the additional_details
  string makes it through to the detail.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* feat(rate-limit): add orthogonal RateLimitType (requests/tokens/concurrent_requests/budget/max_iterations)

trho's last ask in the LIT-2968 thread: distinguish rate-limit failures by
the dimension that was exceeded, not just by who rate-limited (vendor vs.
litellm). Adds:

- RateLimitType str-enum exposed at `litellm.RateLimitType` with values
  requests / tokens / concurrent_requests / budget / max_iterations.
- `rate_limit_type` kwarg on litellm.RateLimitError + ProxyRateLimitError;
  None default so existing callers (vendor-429 path in exception_mapping_utils)
  remain a no-op.
- StandardLoggingPayloadErrorInformation.error_rate_limit_type so custom
  callbacks can split rate-limit failures by cause without parsing free-text
  error messages. Mirror to error_rate_limit_category extraction in
  get_error_information(); single isinstance(RateLimitError) check covers both.
- map_v3_rate_limit_type() helper to collapse the v3 limiter's internal labels
  ("requests", "tokens", "max_parallel_requests") onto the public enum so
  the v3 limiter and dynamic_rate_limiter_v3 share one mapping. Defensive
  None on unknown values rather than silently picking a wrong dimension.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* feat(proxy/hooks): wire rate_limit_type onto every limiter raise site

Each refactored proxy hook now populates rate_limit_type with the dimension
that actually tripped the limit, so downstream consumers (custom callbacks,
prometheus exporters via the StandardLoggingPayload) can split key/team/user
rate-limit failures by cause:

- parallel_request_limiter (v1): detect dimension from current vs. limit in
  the post-cache branch (concurrent_requests > tokens > requests, matches the
  boolean condition order). Base case (current is None, one limit set to 0)
  picks the most-specific zero. raise_rate_limit_error() helper accepts an
  explicit rate_limit_type kwarg with CONCURRENT_REQUESTS default (matches
  every existing internal call site, including the global-limit branch).
- parallel_request_limiter (v3): forward status["rate_limit_type"] through
  map_v3_rate_limit_type() so "max_parallel_requests" → CONCURRENT_REQUESTS
  for the public field while the raw v3 jargon stays on the HTTP header for
  wire-format backward compat.
- dynamic_rate_limiter (v1): TPM-zero → TOKENS, RPM-zero → REQUESTS. Pass
  data["model"] through so callbacks see the model that hit the limit
  (addresses the secondary "provider missing" complaint in the original
  Slack thread, partially — the model is what dashboards typically split on).
- dynamic_rate_limiter (v3): forward status["rate_limit_type"] via
  map_v3_rate_limit_type() at every raise site (model_saturation_check,
  priority_model, fail-closed unknown-descriptor guard). Also pass model.
- batch_rate_limiter: limit_type is hard-typed "requests"|"tokens" — map
  directly without going through the helper's None branch.
- max_budget_limiter, max_budget_per_session_limiter: BUDGET.
- max_iterations_limiter: MAX_ITERATIONS.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(rate-limit): cover RateLimitType enum, hook wiring, and StandardLoggingPayload propagation

27 new tests across five new test classes:

- TestRateLimitType: enum exposed at litellm.RateLimitType, all five values
  defined, RateLimitError default is None (vendor 429 path makes no claim
  about which dimension), accepts both string and enum forms with
  str-coercion guarantee for downstream JSON serializers.
- TestProxyRateLimitErrorType: ProxyRateLimitError default is None, accepts
  string or enum, doesn't break existing callers that pass nothing.
- TestMapV3RateLimitType: pins each v3-internal → public-enum mapping
  (tokens, requests, max_parallel_requests → concurrent_requests, unknown
  → None) so a future v3 refactor can't silently swap dimensions.
- TestStandardLoggingPayloadCarriesType: the new error_rate_limit_type
  field reaches the structured payload for both ProxyRateLimitError and
  plain RateLimitError, is None when unspecified, and is None for
  non-rate-limit exceptions (symmetric with error_rate_limit_category).
- TestProxyHooksWireTypeCorrectly: drives the actual raise sites in the
  v1 parallel_request_limiter helper, the v3 _handle_rate_limit_error
  (both "tokens" and "max_parallel_requests" paths), and the batch
  limiter (both tokens and requests paths) — coverage tools see the new
  rate_limit_type= kwargs as exercised, not just the import shape.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(rate-limit): cover _coerce_message branches and v1 dimension detection

Drives the patch coverage on the new orthogonal RateLimitType wiring up
to (or close to) 100% on the touched files.

ProxyRateLimitError._coerce_message — was 22% covered, now 100%:
* nested {error: {message}} dict
* nested {message: {message}} dict (alt key)
* dict without 'error'/'message' keys → JSON dump fallback
* non-JSON-serializable dict value → str() fallback
* non-string non-mapping detail (int) → str() coercion

v1 parallel_request_limiter dimension detection — was 0% covered, now
exercised across 6 parametrized cases:
* check_key_in_limits else-branch: current at concurrent / TPM / RPM cap
  → asserts rate_limit_type is concurrent_requests / tokens / requests.
* check_key_in_limits base case (current is None): max_parallel_requests
  / tpm_limit / rpm_limit set to 0 → asserts the most-specific zero
  attribution wins per the helper's order.

LIT-2968

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* feat(proxy/hooks): add ProxyHTTPRateLimitError + provider resolver

Introduces a small helper layer used by every proxy-side rate-limit
hook so that the 429 they raise carries a populated llm_provider /
model — instead of an empty exception.llm_provider that downstream
loggers (Prometheus failure metric, observability callbacks) read as
'no provider attribution'.

ProxyHTTPRateLimitError inherits from both fastapi.HTTPException
(so the proxy server still renders it as a 429) and
litellm.exceptions.RateLimitError (so isinstance checks and
PrometheusLogger._get_exception_class_name pick up llm_provider).
We deliberately don't call RateLimitError.__init__ — it constructs
an httpx.Response we don't need and would just add failure surface;
attribute parity is what downstream consumers care about.

resolve_llm_provider_for_rate_limit() wraps litellm.get_llm_provider
defensively. Internal limiter hooks fire from async_pre_call_hook —
well before get_llm_provider runs anywhere else in the request
lifecycle — so we have to call it ourselves at raise time. If the
model is missing or unparseable (alias, router-only model) we fall
back to llm_provider='litellm_proxy' rather than letting a second
exception leak out and break the request path.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(proxy/hooks): populate llm_provider on parallel-request 429s

Both v1 and v3 parallel-request limiters fired bare HTTPException(429)
from inside async_pre_call_hook. The downstream Prometheus failure
metric reads exception.llm_provider via _get_exception_class_name —
the empty value showed up as exception_class='HTTPException' and
left model_id='None' on the time series.

Threads requested_model through every raise site in:

* parallel_request_limiter.py:
  - check_key_in_limits (the per-key/per-model/per-user/per-team/
    per-customer over-limit path)
  - raise_rate_limit_error (zero-limit + global_max_parallel_requests
    paths) — now takes an optional requested_model kwarg
* parallel_request_limiter_v3.py:
  - _handle_rate_limit_error (the OVER_LIMIT translator), called
    from both the should_rate_limit pre-check and the TPM
    reservation path

Resolved via resolve_llm_provider_for_rate_limit so unknown / missing
models silently fall back to llm_provider='litellm_proxy' instead of
breaking the request path with a second exception.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(proxy/hooks): populate llm_provider on dynamic-rate-limit 429s

Same plumbing change as the parallel limiters, applied to both
dynamic_rate_limiter (v1) and dynamic_rate_limiter_v3:

* v1: TPM-zero and RPM-zero paths in async_pre_call_hook now resolve
  data['model'] -> (model, llm_provider) once and pass it into both
  raises.
* v3: All three raise sites in _check_rate_limits — the
  model_saturation_check enforced raise, the priority_model
  enforced raise, and the fail-closed unknown-descriptor branch —
  now attribute the 429 to the actual provider.

Falls back to llm_provider='litellm_proxy' when the model can't be
resolved.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(proxy/hooks): populate llm_provider on batch-rate-limit 429s

batch_rate_limiter._raise_rate_limit_error now takes a
requested_model kwarg threaded from data['model'] in
_check_and_increment_batch_counters. The batch-creation 429 is what
gets raised when the input file's tokens/requests count would push
the per-key TPM/RPM window over its limit.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(proxy/hooks): populate llm_provider on budget/iterations 429s

Final batch of internal raise sites — the user/session-budget and
max-iterations hooks. Same pattern: resolve data['model'] once at
raise time, attach to ProxyHTTPRateLimitError so Prometheus and
observability callbacks can attribute the 429.

Hooks updated:
* max_budget_limiter (per-user max_budget exceeded)
* max_iterations_limiter (per-session agent iteration cap)
* max_budget_per_session_limiter (per-session dollar cap)

All three fall back to llm_provider='litellm_proxy' when data['model']
is missing or unparseable. Drops the now-unused HTTPException import
from each module.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(proxy/hooks): pin provider field on internal rate-limit 429s

Regression coverage for the 'provider field missing' bug across every
proxy-side rate-limit hook + the helper layer:

* ProxyHTTPRateLimitError class shape (HTTPException + RateLimitError,
  dict-detail stringification, None-provider normalization).
* resolve_llm_provider_for_rate_limit happy paths
  (gpt-4o-mini, anthropic/..., bedrock/...) plus all three fallback
  branches (None, '', unknown name) plus a 'get_llm_provider raises'
  case that asserts we swallow the secondary exception.
* For each limiter (parallel v1/v3, dynamic v1/v3, batch,
  max_budget, max_iterations, max_budget_per_session): assert the
  raised exception is a RateLimitError carrying the resolved
  model + llm_provider, and a sibling test that asserts the
  fallback path returns 'litellm_proxy' without leaking a second
  exception.
* Two PrometheusLogger._get_exception_class_name pins so the
  Prometheus failure metric label flips from 'HTTPException' to
  'Openai.ProxyHTTPRateLimitError' (or 'Litellm_proxy.*' on
  fallback) — that's what dashboards consume.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* perf(proxy/hooks): defer provider resolution to over-limit branches

* fix: use error_message in raise_rate_limit_error to avoid literal 'None' in detail

* Consolidate rate_limiter_utils imports in dynamic_rate_limiter

* fix(proxy): set num_retries/max_retries on ProxyHTTPRateLimitError

ProxyHTTPRateLimitError inherits from RateLimitError but did not call
RateLimitError.__init__, so num_retries/max_retries were never set.
When Starlette's HTTPException lacks __str__, MRO falls through to
RateLimitError.__str__, which unconditionally reads these attributes
and raises AttributeError during logging/traceback formatting.
Initialize them to None defensively.

* fix(mypy): silence base-class status_code conflict on ProxyHTTPRateLimitError

HTTPException declares 'status_code: int' while openai.RateLimitError
(via APIStatusError) declares 'status_code: Literal[429] = 429'. Mypy
flags the multi-base override as [misc] in CI lint. The runtime semantics
are fine (we set self.status_code in __init__), so silence the
class-level annotation conflict with a targeted ignore.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix: annotate batch limiter _raise_rate_limit_error as NoReturn

* feat(prometheus): rate-limit category/type labels + exception_class back-compat (follow-up to #27687) (#27706)

* feat(prometheus): add rate_limit_category and rate_limit_type labels

Adds two new labels to litellm_proxy_failed_requests_metric so dashboards
can split 429s by rate-limit source (vendor vs. litellm-internal) and by
the dimension that was exceeded (requests/tokens/concurrent_requests/
budget/max_iterations) without parsing free-text error messages.

Closes the Prometheus side of LIT-2718. The unified RateLimitError.category
and .rate_limit_type fields landed in PR #27687 but were only surfaced on
StandardLoggingPayload (custom-callback channel); this exposes them on
the metric label set as well.

Both labels are populated only when the underlying exception is a
litellm.RateLimitError; non-rate-limit failures keep them empty.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* feat(prometheus): populate rate-limit labels + preserve exception_class back-compat

Two coupled changes in the Prometheus integration:

1. async_post_call_failure_hook now extracts the new RateLimitError
   .category / .rate_limit_type fields (added in PR #27687) via a
   _extract_rate_limit_labels helper and forwards them through
   UserAPIKeyLabelValues onto litellm_proxy_failed_requests_metric.
   Empty for non-rate-limit failures.

2. _get_exception_class_name special-cases ProxyRateLimitError and
   keeps emitting 'HTTPException' for the exception_class label.
   Without this shim, ProxyRateLimitError (which multi-inherits from
   HTTPException + RateLimitError) would silently flip the label
   from 'HTTPException' (the historical value for proxy-side 429s)
   to 'ProxyRateLimitError', breaking existing dashboards / alerts
   that key off exception_class='HTTPException'. Distinguishing
   vendor vs. litellm 429s is now the job of the new
   rate_limit_category label.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(prometheus): cover rate-limit labels and exception_class back-compat

Adds 19 tests across:
- enum / label-list registration
- _extract_rate_limit_labels for vendor RateLimitError, ProxyRateLimitError,
  non-rate-limit and None inputs (incl. parametrized over every
  RateLimitErrorCategory x RateLimitType combo)
- _get_exception_class_name back-compat: ProxyRateLimitError keeps the
  legacy 'HTTPException' string while vendor RateLimitError keeps the
  historical 'Provider.ClassName' format
- end-to-end through async_post_call_failure_hook with both
  ProxyRateLimitError and vendor RateLimitError, asserting both new
  labels populate and exception_class stays back-compat

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(prometheus): tolerate missing fastapi in lazy ProxyRateLimitError import

Address greptile feedback:
- async_post_call_failure_hook docstring: drop the stale labelnames listing
  and reference PrometheusMetricLabels.litellm_proxy_failed_requests_metric
  as the source of truth so the doc cannot drift from the actual labelset.
- _get_exception_class_name: guard the lazy ProxyRateLimitError import with
  ImportError so router-side fallback callsites don't blow up in non-proxy
  installs that don't have fastapi (a transitive dep of
  proxy.common_utils.proxy_rate_limit_error). Behavior is unchanged when
  fastapi is available.

Also fix the existing enterprise callback test that asserted the old
labelset on litellm_proxy_failed_requests_metric — it now expects the new
rate_limit_category / rate_limit_type labels populated for vendor 429s.

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(bugbot): simplify rate-limit label coercion + guard None detail

- prometheus.py _extract_rate_limit_labels: RateLimitError.__init__ already
  normalizes category/rate_limit_type to plain str, so the getattr(.value)
  + isinstance dance was dead code. Reduce to str(value) if not None.
- proxy_rate_limit_error.py _coerce_message: short-circuit None to ''
  instead of falling through to str(None) = 'None', which produced the
  literal message 'litellm.RateLimitError: None'.

* fix(rate-limit): surface unified category/type fields on BudgetExceededError

The most common budget cap (virtual-key max_budget enforcement in
auth_checks.py) raises litellm.BudgetExceededError, a bare Exception
subclass that bypassed the unified rate-limit error class introduced
by PR #27687. Custom callbacks reading
StandardLoggingPayload.error_information saw category=None and
rate_limit_type=None for these 429s, missing the most common budget
case (team / org / end-user budgets all hit the same code path).

Surface the fields off BudgetExceededError as plain attributes:
- category = RateLimitErrorCategory.LITELLM_RATE_LIMIT
- rate_limit_type = RateLimitType.BUDGET
- llm_provider = "" (or caller-supplied)

Switch get_error_information and _extract_rate_limit_labels from
isinstance(RateLimitError) gating to duck-typed attribute reads,
guarded by membership in the rate-limit enums so unrelated third-party
exceptions exposing a .category attribute can't leak garbage values
into the payload.

This is strictly additive: BudgetExceededError keeps its bare-Exception
base class, so `except BudgetExceededError:` handlers keep firing and
`except RateLimitError:` does not start catching budget errors.

* fix(rate-limit): validate enum membership at duck-typed read sites + enrich BudgetExceededError llm_provider

Two follow-ups uncovered during the second QA pass on PR #27687:

1. Guard third-party `.category` / `.rate_limit_type` attribute leakage.
   The duck-typed read in `get_error_information` and
   `_extract_rate_limit_labels` would forward any string attribute named
   `category` / `rate_limit_type` on an unrelated third-party exception
   into the StandardLoggingPayload and Prometheus labels — silently
   mislabeling custom-callback payloads and blowing out Prometheus label
   cardinality. Add `validate_rate_limit_category` /
   `validate_rate_limit_type` helpers that gate on the documented enum
   value sets; non-matching values are dropped to None.

2. Enrich BudgetExceededError.llm_provider from request_data.
   Budget checks live in tenant-scoped helpers (key / team / org / tag /
   end-user / project) that don't see the request model, so the
   BudgetExceededError they raise carried llm_provider="" — leaving
   custom-metrics consumers without provider attribution for the most
   common 429 case. Resolve it once at the central
   UserAPIKeyAuthExceptionHandler seam, before post_call_failure_hook
   fires, so the StandardLoggingPayload the callback sees has the same
   provider attribution as RPM/TPM 429s.

Regression tests pin both: 4 leakage tests + 4 enrichment tests. The
leakage tests would fail under the pre-validation version of either read
site; the enrichment tests would fail if the handler skipped the
resolver call.

* fix(rate-limit): resolve router model_name aliases to real provider (#27914)

* fix(rate-limit): resolve router model_name aliases to real provider

For nearly every real LiteLLM proxy deployment the request model is a
router model_name alias (e.g. 'tpm-locked' -> litellm_params.model:
openai/gpt-4o-mini), and 'litellm.get_llm_provider' doesn't know about
router aliases — it raises 'LLMProviderNotProvidedError'. The resolver
then fell through to the defensive 'litellm_proxy' fallback, so the
'llm_provider' field this PR adds was effectively always
'litellm_proxy' in the field, defeating its purpose for the most common
proxy configuration.

Add a router-alias fallback step: when 'get_llm_provider' raises, scan
the active 'llm_router.model_list' for a deployment whose 'model_name'
matches the request model and resolve from its 'litellm_params.model'
instead. If multiple deployments share the same alias (load-balancing
case) the first one wins — every deployment under one alias should
agree on provider in any sensible config, and 'first' is deterministic
so the Prometheus label stays stable.

Defensive throughout: an uninitialized router, a malformed deployment,
a 'litellm_params.model' that itself fails 'get_llm_provider' — every
branch falls through to the existing 'litellm_proxy' fallback rather
than letting a secondary exception escape and mask the rate-limit
error we're trying to surface.

Tests:
  - test_router_alias_resolves_to_underlying_provider: alias
    'tpm-locked' -> 'openai/gpt-4o-mini' produces provider='openai',
    model='gpt-4o-mini'.
  - test_router_alias_with_multiple_deployments_uses_first.
  - test_router_alias_unknown_falls_back.
  - test_router_alias_with_malformed_deployment_falls_back.
  - Existing fallback test updated to also stub
    'litellm.proxy.proxy_server.llm_router' so it exercises the
    full 'no resolution anywhere' path.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(rate-limit): harden router alias resolver + test isolation

- Wrap _resolve_provider_from_router_alias loop in top-level try/except so
  a non-iterable model_list / unexpected deployment shape can't escape and
  mask the 429 with a 500.
- Type-check litellm_params before .get() to handle non-dict truthy values.
- Patch llm_router=None in the parametrized fallback test so a router left
  by another test in the session can't redirect the unknown-model path.

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(bugbot): preserve "BudgetExceededError" Prometheus label

Adding llm_provider to BudgetExceededError (so callbacks get provider
attribution from StandardLoggingPayload) made the provider-prefix step in
_get_exception_class_name silently flip the label from "BudgetExceededError"
to e.g. "Openai.BudgetExceededError", breaking dashboards keyed on the
historical value.

Short-circuit BudgetExceededError in _get_exception_class_name the same way
ProxyRateLimitError already is. Provider/category attribution still lands on
the new rate_limit_category / rate_limit_type labels.

* test: fix invalid 'rpm' rate_limit_type in v3 limiter test mocks

The v3 rate limiter only emits 'requests', 'tokens', or
'max_parallel_requests'. Using 'rpm' caused map_v3_rate_limit_type to
return None, leaving the expected RateLimitType.REQUESTS untested.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(bugbot): hoist provider resolver + opt-in prom rate-limit labels

- dynamic_rate_limiter.py: hoist resolve_llm_provider_for_rate_limit
  above the TPM/RPM if/elif so the lookup runs once per request, matching
  the pattern in dynamic_rate_limiter_v3.py.
- prometheus.py: gate the new rate_limit_category / rate_limit_type
  labels on litellm_proxy_failed_requests_metric behind
  litellm.prometheus_emit_rate_limit_labels (default False). Mirrors the
  existing prometheus_emit_stream_label opt-in. Preserves the metric's
  pre-unification label set so existing dashboards / recording rules
  keep matching after upgrade; operators can enable the new labels once
  downstream consumers include them.
- Tests updated: default-off back-compat case, opt-in path enables the
  flag before asserting label presence.

* fix: stabilize prometheus label sets and drop redundant model normalization

- Cache PrometheusLogger.get_labels_for_metric per metric_name so that
  the label set used to construct counters at __init__ time stays in
  sync with the label set used at increment time, even if module-level
  toggles like prometheus_emit_rate_limit_labels or
  prometheus_emit_stream_label are flipped at runtime. Without this,
  toggling these flags after the logger was created would cause
  ValueError from prometheus_client because the runtime labels would
  not match the counter's declared labelnames.
- Drop redundant 'model or ""' guard in ProxyRateLimitError.__init__
  where model is already normalized one step earlier.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* perf(dynamic_rate_limiter): only resolve provider when rate limit hit

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* test(prometheus): clear cached metric labels after toggling rate-limit flag

The PrometheusLogger caches each metric's label set at construction
time so that labels used at counter.labels(...) time stay consistent
with the labels the metric was registered with. The enterprise
async_post_call_failure_hook test toggles
litellm.prometheus_emit_rate_limit_labels = True AFTER the fixture
has already built the logger, so without invalidating the cache the
rate_limit_category / rate_limit_type labels never reach the mocked
counter and the assert_called_once_with check fails.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* test: fix CI failures from prom label cache + flaky time-window assertion

PrometheusLogger.get_labels_for_metric now caches the per-metric label
set at first read so the labels passed to counter.labels(...) stay in
lock step with the labels the counter was registered with. This broke
two existing test patterns:

- test_prometheus_labels.py: tests bind the real method onto a
  MagicMock, but MagicMock auto-creates a Mock for _cached_metric_labels
  whose .get(...) returns a truthy Mock — treated as a populated cache
  and returned as the label set, producing empty filtered labels and
  KeyError on labels["requested_model"] / ["route"]. Seed real {}
  containers for _cached_metric_labels and label_filters before binding.

- test_prometheus_logging_callbacks.py::test_set_team_budget_metrics_with_custom_labels:
  the fixture builds the logger before the test monkeypatches
  litellm.custom_prometheus_metadata_labels, so the cached label set
  never picks up the new metadata labels. Clear the cache after the
  monkeypatch (same pattern already used for the rate-limit toggle in
  test_async_post_call_failure_hook).

UI: view_logs/index.test.tsx "Last Minute" window assertion is off by
one at the minute boundary. start_date is floored to the minute, so the
dropped sub-minute fraction can push the truncated-seconds diff up to
(minMinutes+1)*60 exactly when the click lands near a minute rollover.
Switch the upper bound to toBeLessThanOrEqual.

* feat(otel-v2): surface rate_limit_category + rate_limit_type on failed LLM-call spans

PR #28909 introduced the typed v2 OTel engine that builds spans from
StandardLoggingPayload, with SpanError carrying error_type + message and
the genai mapper stamping error.type onto every failed LLM-call span.
This PR's earlier commits added error_rate_limit_category and
error_rate_limit_type to the same StandardLoggingPayload.error_information
the v2 engine reads — but neither field reached a span attribute, so v2
OTel traces stayed opaque about *why* a 429 fired (vendor vs litellm,
RPM vs TPM vs concurrent vs budget vs max_iterations) even after the
custom-callback and prometheus surfaces gained that decomposition.

Three coupled changes:

1. semconv.py: add LiteLLM.ERROR_RATE_LIMIT_CATEGORY /
   LiteLLM.ERROR_RATE_LIMIT_TYPE under the litellm.* vendor namespace
   (no GenAI semconv equivalent exists for who-rate-limited /
   which-dimension).

2. payloads.py: extend SpanError with rate_limit_category +
   rate_limit_type, populated by _parse_error() from the same
   error_information.error_rate_limit_* fields the custom-callback
   channel and prometheus rate_limit_category / rate_limit_type labels
   read. Single source of truth across all three observability surfaces.

3. mappers/genai.py: stamp the two attributes on the LLM-call span when
   present. drop_none guarantees they stay absent (not 'None') for
   non-rate-limit failures so trace consumers can read them
   unconditionally.

Three regression tests in test_otel_v2_emitter.py pin: a vendor /
litellm-internal RateLimitError lands category=litellm_rate_limit +
rate_limit_type=requests on the span; a BudgetExceededError lands
rate_limit_type=budget; a non-rate-limit failure (BadRequestError)
keeps the rate_limit_* attributes absent. Mutation-tested against
reverting either the SpanError extension or the _parse_error read site
— both new tests fail under either mutation.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test: align prometheus user-budget + logs quick-select tests with merged code

The merge into this branch left two test patterns out of step with the code
they exercise.

test_set_user_budget_metrics_includes_user_email_and_alias_labels_when_opted_in
flipped litellm.prometheus_user_budget_label_include_email_alias after the
fixture had already built the PrometheusLogger. get_labels_for_metric now
snapshots each metric's label set at construction time, so the runtime flip
no longer reached the cached labels. Enable the flag before constructing the
logger, matching how the proxy applies config at startup.

view_logs/index.test.tsx referenced uiSpendLogsCall and moment without
importing them, and the merged index.tsx now fetches through
useLogFilterLogic (the hook the file stubs out) rather than calling
uiSpendLogsCall directly. Add the imports and restore the real hook for the
Quick Select window assertions so the call is actually observed.

* refactor(otel/v2): drop rate-limit decomposition from the LLM-call span

Proxy-side rate limits (litellm_rate_limit, budget, max_iterations) are
rejected at the gate before any upstream call, so async_post_call_failure_hook
tags the synthetic failure log with LITELLM_LOGGING_NO_UPSTREAM_LLM_CALL and the
v2 OTel logger never opens an LLM-call span for them; the
litellm.error.rate_limit_category / litellm.error.rate_limit_type attributes
were dead for exactly the cases they were meant to surface. The only failure
that does open an LLM-call span carrying a RateLimitError is a vendor 429, where
rate_limit_type is always None and the category just restates
error.type=RateLimitError.

The decomposition still reaches downstream consumers through
StandardLoggingPayload.error_information.error_rate_limit_* and the prometheus
rate_limit_category / rate_limit_type labels, both unchanged.

Removes the SpanError fields, the _parse_error reads, the genai mapper
attributes, the semconv keys, and the three span tests that asserted a scenario
that never reaches the mapper in production.

* fix(batch_rate_limiter): map max_parallel_requests to concurrent_requests

* refactor(prometheus): drop transitive fastapi import from _get_exception_class_name

Read the legacy exception_class label from a prometheus_exception_class_name
marker on ProxyRateLimitError instead of importing the proxy module, keeping
the integrations layer free of a transitive fastapi dependency.

* chore(ui): sync schema.d.ts with unified rate-limit error spec

The ProxyRateLimitError docstring flows into the proxy OpenAPI spec's 429
response description, so the generated dashboard types were out of sync.
Regenerated via npm run gen:api (Check UI API Types Sync).

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
2026-06-06 17:50:29 -07:00
..
litellm_enterprise feat: standardize rate limit errors with category, rate_limit_type, model, and llm_provider fields (#27687) 2026-06-06 17:50:29 -07:00
conftest.py test mapped test fixes 2025-08-23 17:04:23 -07:00