The flag was an opt-in escape hatch for the cross-tenant leak the rest
of the patch closes — flipping it on (env var or constructor param)
re-enables exactly the VERIA-54 primitive on either backend. There is
no operational need that the secure path doesn't already meet:
- Qdrant: legacy points without ``litellm_cache_key`` payload are
excluded by the must-clause filter and treated as misses; new sets
populate the cache key, so cold-start lasts only as long as the
natural cache rebuild.
- Redis: existing unscoped index can't carry the new schema; the init
path falls back to ``{name}_isolated`` (and recreates it on stale
schema), leaving the legacy index untouched.
Drop the constructor param, env-var fallback, ``_using_legacy_unscoped_index``
flag, the legacy-reuse branch in ``_init_semantic_cache``, and the
matching guards in set/get paths. Update tests to drop the legacy-mode
cases and assert the secure-only behaviour.
Mypy infers the dict's value type from the first branch
(Dict[str, bool]) which clashes with the scalar branch's mixed-type
inner dict. Explicit Dict[str, Any] annotation lifts the inference.
The hooks gated on ``call_type == "completion"`` but the proxy ingress
passes ``route_type`` straight through as ``call_type`` —
``"acompletion"`` for /v1/chat/completions and ``"aresponses"`` for
/v1/responses. Tests passed because they used the literal sync
``"completion"`` value, masking the gap.
Switch both hooks to ``is_text_content_call_type`` (matches the
canonical runtime values: completion / acompletion / aresponses) and
update existing tests to assert against runtime values, plus parametrize
a regression test that pins the gate.
Three call sites raised the same BadRequestError("Invalid reasoning_effort:
... Must be one of 'minimal', 'low', ...") block when REASONING_EFFORT_TO_OUTPUT_CONFIG_EFFORT
returned None: anthropic chat map_openai_params, bedrock converse
_handle_reasoning_effort_parameter, and databricks chat reasoning_effort path.
Extract AnthropicConfig._raise_invalid_reasoning_effort(model, value, llm_provider)
so future copy edits / valid-set changes happen in one place. Typed as NoReturn
so type-checkers correctly narrow control flow at call sites.
Addresses Michael's review on PR #27074.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
Three call sites (anthropic chat, bedrock converse, bedrock invoke messages)
emitted the same '...Effort is only supported on Opus 4.5+, Sonnet 4.6+, and
Mythos Preview' warning verbatim. Extract DROP_UNSUPPORTED_OUTPUT_CONFIG_WARNING
in litellm/llms/anthropic/chat/transformation.py and import it from the bedrock
sites so future copy edits live in one place.
Addresses Michael's review on PR #27074.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
Strip out the explanatory and historical comments that don't carry
business-logic justification. Comments that simply narrate what code
does — or that explain prior behavior, what was changed, or which PR
introduced a fix — are removed. Docstrings are reduced to a one-line
summary where the long form repeated information already evident from
the code or test data.
No code-behavior changes. All 643 affected unit tests still pass.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
`test_aaamodel_prices_and_context_window_json_is_valid` validates the
model-map JSON against an explicit schema with `additionalProperties`,
so the new `supports_adaptive_thinking` flag added in
98ced0ae43 needs a matching schema entry.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
Three of greptile's open comments on #27074 (P2 converse:512, P1
databricks:361, and the underlying capability-flag policy rule) flagged
the same pattern: _is_claude_4_6_model(...) or _is_claude_4_7_model(...)
used inline as a runtime 'is this an adaptive-thinking model?' check.
That requires a code release each time a new adaptive Claude lands.
Consolidate the inline gating to AnthropicModelInfo._is_adaptive_thinking_model,
and switch the helper itself to read a new supports_adaptive_thinking
flag from `model_prices_and_context_window.json` via `_supports_factory`,
falling back to the family pattern only when the model-map entry doesn't
carry the flag (preserves OpenRouter / Vercel / Bedrock-prefixed variants
that route through the same code path with non-canonical ids).
Adds `supports_adaptive_thinking: true` to the four 4.6/4.7 anthropic
entries (opus-4-6 + dated, opus-4-7 + dated, sonnet-4-6). Bedrock-prefixed
and Vertex-prefixed entries don't need the flag because both fall back
through the family pattern (the helper short-circuits early on True from
either path) and the bedrock/vertex Claude IDs all match the existing
opus-4-{6,7} / sonnet-4-{6,7} pattern.
Affected call sites:
- `bedrock/chat/converse_transformation.py:_handle_reasoning_effort_parameter`
- `anthropic/chat/transformation.py:_map_reasoning_effort`
- `anthropic/chat/transformation.py:map_openai_params` (output_config branch)
- `databricks/chat/transformation.py:map_openai_params` (output_config branch)
The remaining `_is_claude_4_6_model` / `_is_claude_4_7_model` references
in `AnthropicConfig._validate_effort_for_model` and
`AnthropicConfig.get_supported_openai_params` are intentionally retained:
they're per-model gating fallbacks for variants whose model-map entries
don't yet carry the `supports_max_reasoning_effort` /
`supports_reasoning` flag. Those are documented in-place.
Tests: 537 anthropic/bedrock/databricks/vertex/messages tests pass.
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
Removes the duplicated max/xhigh gating logic in
_validate_anthropic_adaptive_effort and the now-unused
_supports_effort_level_on_bedrock helper. Per-model gating now flows
through the centralized AnthropicConfig._validate_effort_for_model
(whose _supports_effort_level already strips Bedrock prefixes), so the
chat completion, /v1/messages, and Bedrock Converse paths can't drift
when a new gated effort tier is added.
xAI's chat completions API accounts reasoning_tokens separately from
completion_tokens, but rolls them into total_tokens. This breaks the
OpenAI invariant total_tokens == prompt_tokens + completion_tokens
that downstream consumers (including litellm's own _usage_format_tests
in tests/llm_translation/base_llm_unit_tests.py:58) rely on.
Live capture (grok-3-mini-beta, 2026-05-04):
prompt=14, completion=10, total=336, reasoning=312
14 + 10 = 24, NOT 336.
OpenAI's o1/o3 reasoning models include reasoning_tokens in
completion_tokens, leaving the prompt+completion=total invariant
intact. xAI deviates. This patch aligns xAI to OpenAI semantics by
folding reasoning_tokens into completion_tokens after the parent
OpenAI parser runs.
The fold is idempotent and defensive:
- Only fires when total_tokens == prompt_tokens + completion_tokens
+ reasoning_tokens (the documented xAI shape). Refuses to fold if
the gap doesn't match, guarding against silent corruption when xAI
changes accounting.
- Skips if completion_tokens already covers the gap (already
normalised — e.g. cost calc replays a previously-folded Usage).
xai.cost_calculator.cost_per_token already added reasoning_tokens to
the visible completion count for billing. Post-fold the Usage block
now satisfies that invariant directly, so the cost calc would
double-bill. Updated cost_per_token to detect the OpenAI-normalised
shape (total == prompt + completion) and skip the reasoning add-on
in that case, falling through to the legacy raw-shape behaviour for
callers that bypass the transformation (e.g. proxy log replay).
Tests:
- Adds TestXAIReasoningTokenFolding covering: gap-explained-fold,
idempotent-no-double-fold, no-reasoning-skip, gap-mismatch-skip.
- Adds test_already_normalised_usage_does_not_double_count_reasoning
to lock the cost-calc idempotency.
- Updates 7 pre-existing cost-calc tests whose total_tokens was
internally inconsistent (used the OpenAI-normalised total but kept
reasoning_tokens external) to use the documented xAI raw shape
total = prompt + visible completion + reasoning. Pre-existing
values masked the missing-fold by accident.
Verified end-to-end against the live xAI API:
LITELLM_LOCAL_MODEL_COST_MAP=False (CI default) +
XAI_API_KEY set +
pytest tests/llm_translation/test_xai.py::TestXAIChat::test_prompt_caching
-> PASSED in 18.81s (was: AssertionError on
usage.total_tokens == usage.prompt_tokens + usage.completion_tokens)
20/20 tests in tests/test_litellm/llms/xai/test_xai_cost_calculator.py
and 8/8 in tests/test_litellm/llms/xai/test_xai_chat_transformation.py
pass.
Apply the same fix to the three Dockerfiles not in the release pipeline
today (alpine, dev, health_check) so they stay correct if/when they're
built for arm64 in the future.
Wolfi pins are not present in these files; the python:3.11-alpine and
python:3.13-slim digests they already use are multi-arch indexes that
include arm64/v8, so only the uv pin needed swapping.
The previous pins resolved to single-platform amd64 manifests, so buildx
pulled the same amd64 base for both linux/amd64 and linux/arm64 targets.
The published OCI index then advertised an arm64 entry whose layers are
byte-identical to amd64 -- arm64 users got an amd64 binary.
Switch all three Dockerfiles to the multi-arch image-index digests:
- cgr.dev/chainguard/wolfi-base (index has linux/amd64 + linux/arm64)
- ghcr.io/astral-sh/uv:0.11.7 (index has linux/amd64 + linux/arm64)
Resolved with `docker buildx imagetools inspect <ref>` -- that returns
the index digest. `docker pull` + `docker inspect` returns the per-host
platform digest, which is what slipped in last time.
CI runs without LITELLM_LOCAL_MODEL_COST_MAP=True, so litellm.model_cost
is loaded from main-branch JSON (default model_cost_map_url) instead of
the PR's checked-out model_prices_and_context_window.json. Tests that
assert per-model flags added in this PR (supports_max_reasoning_effort,
supports_xhigh_reasoning_effort) therefore pass locally but fail in CI
with 'AssertionError: assert False is True' on 5 cases:
- test_anthropic_model_supports_effort_param_recognizes_supporting_models
[anthropic.claude-mythos-preview, bedrock/.../mythos-preview,
claude-opus-4-5-20251101]
- test_supports_effort_level_handles_provider_prefixes
[bedrock/invoke/us.anthropic.claude-sonnet-4-6-max-True,
claude-sonnet-4-6-max-True]
Add an autouse fixture at tests/test_litellm/llms/anthropic/chat/conftest.py
that monkey-patches litellm.model_cost to the PR-local JSON for every test
in this directory. The parent conftest already snapshots+restores
litellm.model_cost per-function, so the mutation is contained.
This is a scoped workaround. The proper fix is to set the env var
globally in the test workflow once the ~10 inline self-set test files
are audited; tracking that as a follow-up issue.
`non_default_params.get("reasoning_effort")` returns `Any | None`,
but `REASONING_EFFORT_TO_OUTPUT_CONFIG_EFFORT.get()` expects `str`.
Mypy flagged this on the strict pass. Narrow with `isinstance` before
the lookup; non-strings fall through to the existing `BadRequestError`
below with a clean validation message, so behavior is unchanged.
Fixes a regression introduced by 1a10746e95 in this PR.
The chat completion path (`_apply_output_config`) and the /v1/messages
pass-through (`AnthropicMessagesConfig._translate_reasoning_effort_to_anthropic`)
both gate `max` / `xhigh` per model. The two sites had diverged from
near-identical copies into separately maintained blocks, creating a real
drift risk when a new model tier (e.g. Claude 4.8) lands -- a contributor
could update one site and miss the other.
Centralise the gating in `AnthropicConfig._validate_effort_for_model`,
which returns an error message string or `None`. Each call site keeps
its own provider-appropriate exception type (`BadRequestError` for the
chat path, `AnthropicError` for the /v1/messages pass-through) but the
gating decision now comes from one place. Net -11 LOC.
Adds a parametrised unit test exercising the helper directly across
4.5 / 4.6 / 4.7 model families and `max` / `xhigh` / lower-effort
inputs. Existing tests at both call sites continue to pass unchanged.
Addresses Greptile finding on PR #27074.