fix(register_model): preserve built-in cache pricing when registering custom overrides under unmapped keys (#30044)

* fix(spend-tracking): fall back to direct spend-counter increment when reservation reconcile fails

When the reservation-reconcile path in `_reconcile_budget_reservation_for_counter_update`
hits a Redis error, it now correctly returns an empty set so that
`increment_spend_counters` re-runs the direct increment for the affected counters.
Previously, the function logged the failure, invalidated the reserved counters, and
still returned the reserved counter keys, which caused the caller to skip the direct
increment. With the increment skipped and the counter deleted, the next request
reseeded the counter from `LiteLLM_VerificationToken.spend`, a column the batched
flusher only updates every few seconds, so the enforced cross-pod spend value
collapsed to a stale snapshot and budget gating stopped firing for affected keys.

Adds a regression test that exercises the failure path with a flaky redis backend
and asserts the actual response cost lands in the shared counter.

* fix(register_model): preserve built-in cache pricing when registering custom overrides under unmapped keys

When a custom-priced model is registered under a key shape that
get_model_info cannot resolve (e.g. litellm_params.model set to
bedrock/bedrock/us.anthropic.claude-sonnet-4-6 or another non-canonical
alias), register_model previously fell back to an empty existing_model.
The merged entry then carried only the fields the user set explicitly
(input/output cost, provider) and dropped cache pricing. Downstream the
cost calculator defaulted cache_creation_input_token_cost and
cache_read_input_token_cost to 0, silently dropping the bulk of the bill
for cache-heavy Anthropic traffic.

register_model now attempts to resolve a canonical built-in entry by
stripping provider prefixes, region prefixes, and provider-specific
suffixes before giving up. When a variant resolves, its defaults
(notably cache pricing) are inherited while the user's explicit overrides
still win. When nothing resolves and the user supplied no cache pricing,
it logs a warning instead of silently under-billing.

* fix(router): inherit built-in cache pricing on deployments with partial custom pricing

A deployment configured with only input_cost_per_token and output_cost_per_token
under model_info was being registered under its model_info.id with no cache cost
fields. The cost calculator then defaulted cache_creation_input_token_cost and
cache_read_input_token_cost to 0, silently billing cache_read and cache_creation
tokens at zero. For cache-heavy Anthropic traffic this drops the bulk of the bill.

When the deployment's litellm_params.model resolves to a built-in cost-map entry,
pull the cache pricing fields from there before registering. User-specified
cache fields still win on merge; only missing fields are inherited.

Pairs with the register_model fallback added earlier in this branch: that
handles unmapped key shapes like bedrock/bedrock/x, this handles deploy-id
keys whose backend model is mapped.

* fix(register_model): inherit only cache pricing on unmapped-key fallback, not provider

The unmapped-key fallback in register_model copied the entire resolved
built-in entry, so registering openai/command-r-plus inherited the cohere
built-in's litellm_provider and get_model_info(custom_llm_provider=openai)
could no longer resolve it. Restrict the fallback to the cache-pricing
fields, matching the router-side _inherit_builtin_cache_pricing, so the
cache-cost dropout stays fixed without clobbering the registered provider.

Add a direct unit test for Router._inherit_builtin_cache_pricing so the
router coverage check sees it, and pin the fixed spend-counter contract:
when reservation reconcile fails the counter must hold the directly
incremented cost rather than being left at None.
This commit is contained in:
Yassin Kortam 2026-06-10 12:11:03 -07:00 committed by GitHub
parent a75ed0079c
commit 410b892f77
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
8 changed files with 480 additions and 9 deletions

View File

@ -2266,7 +2266,7 @@ async def _reconcile_budget_reservation_for_counter_update(
)
except Exception:
verbose_proxy_logger.warning(
"Failed to reconcile budget reservation after persisted spend; invalidating reserved counters and continuing",
"Failed to reconcile budget reservation after persisted spend; invalidating reserved counters and falling back to direct increment",
exc_info=True,
)
try:
@ -2277,6 +2277,7 @@ async def _reconcile_budget_reservation_for_counter_update(
verbose_proxy_logger.exception(
"Failed to invalidate reserved counters after reservation reconciliation failed"
)
return set()
return reserved_counter_keys

View File

@ -7788,6 +7788,39 @@ class Router:
return hash_object.hexdigest()
@staticmethod
def _inherit_builtin_cache_pricing(
model_info: dict, backend_model: str, custom_llm_provider: Optional[str]
) -> None:
"""Fill missing cache pricing on a custom-priced deployment entry from
the backend model's built-in cost map entry, so a deployment that
only spells out ``input_cost_per_token``/``output_cost_per_token``
does not silently bill cache_read/cache_creation at 0.
User-specified cache fields always win; only ``None``/missing entries
are inherited. No-op when the backend model has no canonical entry.
"""
cache_fields = (
"cache_creation_input_token_cost",
"cache_creation_input_token_cost_above_1hr",
"cache_creation_input_token_cost_above_200k_tokens",
"cache_read_input_token_cost",
"cache_read_input_token_cost_above_200k_tokens",
)
if all(model_info.get(f) is not None for f in cache_fields):
return
try:
backend_info = litellm.get_model_info(
model=backend_model, custom_llm_provider=custom_llm_provider
)
except Exception:
return
for field in cache_fields:
if model_info.get(field) is None:
backend_value = backend_info.get(field)
if backend_value is not None:
model_info[field] = backend_value
def _create_deployment(
self,
deployment_info: dict,
@ -7816,6 +7849,13 @@ class Router:
if deployment.litellm_params.get(field) is not None:
_model_info[field] = deployment.litellm_params[field]
if _model_info.get("input_cost_per_token") is not None:
Router._inherit_builtin_cache_pricing(
model_info=_model_info,
backend_model=deployment.litellm_params.model,
custom_llm_provider=deployment.litellm_params.custom_llm_provider,
)
## REGISTER MODEL INFO IN LITELLM MODEL COST MAP
model_id = deployment.model_info.id
if model_id is not None:
@ -8562,6 +8602,13 @@ class Router:
if field_value is not None:
_model_info_dict[field] = field_value
if _model_info_dict.get("input_cost_per_token") is not None:
Router._inherit_builtin_cache_pricing(
model_info=_model_info_dict,
backend_model=deployment.litellm_params.model,
custom_llm_provider=deployment.litellm_params.custom_llm_provider,
)
# Register custom pricing in litellm.model_cost.
# Mirrors _create_deployment() logic to ensure dynamically-added deployments
# (e.g., loaded from DB) also have their custom pricing registered.

View File

@ -2887,6 +2887,61 @@ def _convert_stringified_numbers(value):
return value
_BEDROCK_REGION_PREFIXES = (
"us.",
"eu.",
"apac.",
"jp.",
"au.",
"us-gov.",
"global.",
"ap-northeast-1.",
)
_CACHE_PRICING_FIELDS = (
"cache_creation_input_token_cost",
"cache_creation_input_token_cost_above_1hr",
"cache_creation_input_token_cost_above_200k_tokens",
"cache_read_input_token_cost",
"cache_read_input_token_cost_above_200k_tokens",
)
def _resolve_builtin_model_cost_entry(
key: str, provider: str
) -> Optional[Dict[str, Any]]:
"""Best-effort lookup of a built-in ``model_cost`` entry for a custom key
whose shape ``get_model_info`` cannot resolve (double provider prefixes
like ``bedrock/bedrock/us.anthropic.claude-sonnet-4-6`` or region aliases).
Returns a copy of the matching entry so the caller can inherit its defaults
(most importantly cache pricing) without mutating the shared built-in.
Returns ``None`` when no safe match exists.
"""
candidates: List[str] = []
segments = key.split("/")
idx = 0
while idx < len(segments) - 1 and segments[idx] in LlmProvidersSet:
idx += 1
candidates.append("/".join(segments[idx:]))
base = candidates[-1] if candidates else key
for region_prefix in _BEDROCK_REGION_PREFIXES:
if base.startswith(region_prefix):
candidates.append(base[len(region_prefix) :])
if provider:
stripped = _strip_model_name(model=base, custom_llm_provider=provider)
if stripped != base:
candidates.append(stripped)
for candidate in candidates:
entry = litellm.model_cost.get(candidate)
if entry is not None and entry.get("litellm_provider") is not None:
return dict(entry)
return None
def register_model(model_cost: Union[str, dict]): # noqa: PLR0915
"""
Register new / Override existing models (and their pricing) to specific providers.
@ -2933,6 +2988,26 @@ def register_model(model_cost: Union[str, dict]): # noqa: PLR0915
except Exception:
existing_model = {}
model_cost_key = key
builtin_entry = _resolve_builtin_model_cost_entry(
key=_key_str, provider=provider
)
if builtin_entry is not None:
for field in _CACHE_PRICING_FIELDS:
if (
value.get(field) is None
and builtin_entry.get(field) is not None
):
existing_model[field] = builtin_entry[field]
elif (
value.get("cache_creation_input_token_cost") is None
and value.get("cache_read_input_token_cost") is None
):
verbose_logger.warning(
f"register_model: model={key} not in built-in cost map and no "
"prefix/region variant matched; cache cost fields will default "
"to 0. To track cache cost, add cache_creation_input_token_cost "
"and cache_read_input_token_cost to model_info"
)
# ``get_model_info`` returns ``litellm_provider: None`` when the
# provider is unknown (e.g. custom deployments registered via
# ``Router.add_deployment``). Persisting that None into

View File

@ -192,8 +192,9 @@ async def test_reconcile_budget_reservation_for_counter_update_returns_empty_set
async def test_reconcile_budget_reservation_for_counter_update_failure_invalidates(
monkeypatch,
):
"""Reservation reconcile raising must invalidate reserved counters but
not propagate the exception."""
"""Reservation reconcile raising must invalidate reserved counters, swallow
the exception, and return an empty set so the caller falls back to the
direct spend-counter increment instead of skipping it."""
import litellm.proxy.spend_tracking.budget_reservation as br
monkeypatch.setattr(
@ -213,7 +214,7 @@ async def test_reconcile_budget_reservation_for_counter_update_failure_invalidat
budget_reservation={"foo": "bar"}, response_cost=1.0
)
assert result == {"spend:key:abc"}
assert result == set()
assert fake_invalidate.called is True

View File

@ -0,0 +1,87 @@
"""
Regression test for enforced-spend underreporting when Redis fails during the
budget-reservation reconcile step of ``increment_spend_counters``.
Production failure mode: a managed Redis returns an intermittent timeout on the
reconcile increment. Reconcile deletes (invalidates) the shared counter and
gives up, but ``increment_spend_counters`` still treats the counter as
"already reconciled" and skips the direct increment. The actual call cost never
lands in the enforced counter, so budgets stop gating until the next cold
reseed pulls a lagging value from the DB.
The fix makes the reconcile path fall back to the direct increment when it
fails, so the actual cost is always written to the shared counter.
"""
import pytest
from litellm.caching import DualCache
from litellm.proxy import proxy_server
class _FlakyRedisCache:
def __init__(self) -> None:
self._store: dict = {}
self._increment_calls = 0
async def async_increment(self, key, value, **kwargs):
self._increment_calls += 1
if self._increment_calls == 1:
raise Exception("Redis timeout")
self._store[key] = float(self._store.get(key, 0.0)) + float(value)
return self._store[key]
async def async_get_cache(self, key, *args, **kwargs):
return self._store.get(key)
async def async_delete_cache(self, key, *args, **kwargs):
self._store.pop(key, None)
async def async_set_cache(self, key, value, *args, **kwargs):
self._store[key] = float(value)
return True
@pytest.mark.asyncio
async def test_direct_increment_runs_when_reservation_reconcile_hits_redis_failure(
monkeypatch,
):
hashed_token = "hashed_test_token"
counter_key = f"spend:key:{hashed_token}"
reserved_cost = 0.5
response_cost = 1.0
flaky_redis = _FlakyRedisCache()
flaky_redis._store[counter_key] = reserved_cost
monkeypatch.setattr(proxy_server, "prisma_client", None)
monkeypatch.setattr(proxy_server, "user_api_key_cache", DualCache())
monkeypatch.setattr(proxy_server.spend_counter_cache, "redis_cache", flaky_redis)
proxy_server.spend_counter_cache.in_memory_cache.set_cache(
key=counter_key, value=reserved_cost
)
budget_reservation = {
"reserved_cost": reserved_cost,
"finalized": False,
"entries": [
{
"counter_key": counter_key,
"entity_type": "Key",
"entity_id": hashed_token,
"reserved_cost": reserved_cost,
"applied_adjustment": 0.0,
}
],
}
await proxy_server.increment_spend_counters(
token=hashed_token,
team_id=None,
user_id=None,
response_cost=response_cost,
budget_reservation=budget_reservation,
)
enforced_spend = await flaky_redis.async_get_cache(key=counter_key)
assert enforced_spend == response_cost

View File

@ -6609,7 +6609,12 @@ async def test_increment_spend_counters_finalizes_none_cost_reservation():
@pytest.mark.asyncio
async def test_increment_spend_counters_invalidates_bad_reserved_counter_without_failing():
async def test_increment_spend_counters_falls_back_to_direct_increment_on_bad_reserved_counter():
"""When the reservation reconcile fails, the reserved counters are
invalidated and the actual response cost must still be written via the
direct increment fallback. Leaving the counter at ``None`` lets the next
request reseed a stale value from the DB and silently stops budget gating,
which is the bug this fix addresses."""
from litellm.caching.dual_cache import DualCache
from litellm.proxy.proxy_server import increment_spend_counters
@ -6650,7 +6655,7 @@ async def test_increment_spend_counters_invalidates_bad_reserved_counter_without
counter_cache.in_memory_cache.get_cache(
key="spend:key:key-bad-reserved-counter"
)
is None
== 0.25
)
finally:
ps.spend_counter_cache = orig_counter

View File

@ -301,6 +301,126 @@ def test_register_model_strips_none_litellm_provider_from_get_model_info(monkeyp
litellm.model_cost.pop(model_key, None)
def test_register_model_inherits_builtin_cache_pricing_for_unmapped_key():
"""Registering a custom override under a key shape that
``get_model_info`` cannot resolve (e.g. a double provider prefix like
``bedrock/bedrock/us.anthropic.claude-sonnet-4-6``) must still inherit
the built-in cache pricing for the underlying model.
Before the fix ``register_model`` fell back to an empty ``existing_model``
so the merged entry only carried the fields the user set explicitly
(input/output cost). ``cache_creation_input_token_cost`` and
``cache_read_input_token_cost`` were absent, and the cost calculator
silently charged 0 for every cache token, dropping the bulk of the bill
for cache-heavy Anthropic traffic.
Regression for the cache-pricing dropout under partial overrides.
"""
from litellm.litellm_core_utils.llm_cost_calc.utils import generic_cost_per_token
from litellm.types.utils import PromptTokensDetailsWrapper, Usage
original_model_cost = litellm.model_cost
os.environ["LITELLM_LOCAL_MODEL_COST_MAP"] = "True"
litellm.model_cost = litellm.get_model_cost_map(url="")
builtin_key = "us.anthropic.claude-sonnet-4-6"
registered_key = f"bedrock/bedrock/{builtin_key}"
builtin = litellm.model_cost[builtin_key]
assert builtin["cache_creation_input_token_cost"] > 0
assert builtin["cache_read_input_token_cost"] > 0
try:
litellm.register_model(
{
registered_key: {
"input_cost_per_token": builtin["input_cost_per_token"],
"output_cost_per_token": builtin["output_cost_per_token"],
"litellm_provider": "bedrock",
}
}
)
registered = litellm.model_cost[registered_key]
assert (
registered.get("cache_creation_input_token_cost")
== builtin["cache_creation_input_token_cost"]
)
assert (
registered.get("cache_read_input_token_cost")
== builtin["cache_read_input_token_cost"]
)
assert registered["litellm_provider"] == "bedrock"
usage = Usage(
prompt_tokens=1100,
completion_tokens=100,
total_tokens=1200,
prompt_tokens_details=PromptTokensDetailsWrapper(
cached_tokens=800,
text_tokens=100,
),
cache_creation_input_tokens=200,
)
input_cost, output_cost = generic_cost_per_token(
model=registered_key,
usage=usage,
custom_llm_provider="bedrock",
)
text_only_cost = builtin["input_cost_per_token"] * 100
expected_input_cost = (
text_only_cost
+ builtin["cache_read_input_token_cost"] * 800
+ builtin["cache_creation_input_token_cost"] * 200
)
assert abs(input_cost - expected_input_cost) < 1e-12
assert abs(output_cost - builtin["output_cost_per_token"] * 100) < 1e-12
assert input_cost > text_only_cost + 1e-12
finally:
litellm.model_cost.pop(registered_key, None)
litellm.model_cost = original_model_cost
os.environ.pop("LITELLM_LOCAL_MODEL_COST_MAP", None)
from litellm.utils import _invalidate_model_cost_lowercase_map
_invalidate_model_cost_lowercase_map()
def test_register_model_warns_when_no_builtin_match_for_cache_pricing(caplog):
"""When a custom override is registered under a key that neither
``get_model_info`` nor any prefix/region variant can resolve to a
built-in entry, ``register_model`` must warn that cache cost fields will
default to 0 instead of silently producing an under-billed entry.
"""
import logging
from litellm._logging import verbose_logger
registered_key = "bedrock/totally-made-up-model-alias-xyz"
litellm.model_cost.pop(registered_key, None)
try:
with caplog.at_level(logging.WARNING, logger=verbose_logger.name):
litellm.register_model(
{
registered_key: {
"input_cost_per_token": 0.001,
"output_cost_per_token": 0.002,
"litellm_provider": "bedrock",
}
}
)
assert any(
registered_key in record.message
and "cache_creation_input_token_cost" in record.message
for record in caplog.records
), "expected a warning naming the unmapped key and the cache cost fields"
finally:
litellm.model_cost.pop(registered_key, None)
def test_register_model_router_add_deployment_custom_pricing_applies():
"""End-to-end regression for https://github.com/BerriAI/litellm/issues/28336.
@ -344,9 +464,9 @@ def test_register_model_router_add_deployment_custom_pricing_applies():
f"{model_key} / {deployment_model}"
)
for k in registered_keys:
assert _check_provider_match(litellm.model_cost[k], "openai") is True, (
f"custom pricing for {k} was dropped by _check_provider_match"
)
assert (
_check_provider_match(litellm.model_cost[k], "openai") is True
), f"custom pricing for {k} was dropped by _check_provider_match"
finally:
litellm.model_cost.pop(model_key, None)
litellm.model_cost.pop(deployment_model, None)

View File

@ -402,3 +402,138 @@ def test_should_not_downgrade_chatgpt_shared_key_mode_with_alias_override():
assert bridge_model_info["mode"] == "responses"
finally:
_restore_model_cost_entries(model_keys)
def test_partial_custom_pricing_inherits_builtin_cache_pricing():
"""A deployment that overrides only input/output cost on a cache-supporting
model must still bill cache_read and cache_creation tokens. Before the
fix the deploy-id entry was registered with the user's two fields and
nothing else, so the cost calculator silently billed cache tokens at 0.
Regression for the prompt-caching cost dropout reported by the customer.
"""
backend_model = "anthropic/claude-sonnet-4-5-20250929"
deploy_id = "claude-deploy-partial-pricing"
builtin_info = litellm.get_model_info(model=backend_model)
builtin_cache_create = builtin_info["cache_creation_input_token_cost"]
builtin_cache_read = builtin_info["cache_read_input_token_cost"]
assert builtin_cache_create is not None and builtin_cache_create > 0
assert builtin_cache_read is not None and builtin_cache_read > 0
model_keys = {
deploy_id: litellm.model_cost.get(deploy_id),
backend_model: copy.deepcopy(litellm.model_cost.get(backend_model)),
}
try:
Router(
model_list=[
{
"model_name": "claude-custom",
"litellm_params": {
"model": backend_model,
"api_key": "fake-key",
},
"model_info": {
"id": deploy_id,
"input_cost_per_token": 0.000003,
"output_cost_per_token": 0.000015,
},
}
],
)
entry = litellm.model_cost[deploy_id]
assert entry["input_cost_per_token"] == 0.000003
assert entry["output_cost_per_token"] == 0.000015
assert entry.get("cache_creation_input_token_cost") == builtin_cache_create
assert entry.get("cache_read_input_token_cost") == builtin_cache_read
finally:
_restore_model_cost_entries(model_keys)
def test_partial_pricing_does_not_overwrite_explicit_cache_fields():
"""When the user explicitly sets cache_*_input_token_cost on a deployment,
those values must not be replaced by the built-in fallback.
"""
backend_model = "anthropic/claude-sonnet-4-5-20250929"
deploy_id = "claude-deploy-explicit-cache"
explicit_cache_create = 0.00001
explicit_cache_read = 0.0000005
builtin_info = litellm.get_model_info(model=backend_model)
assert builtin_info["cache_creation_input_token_cost"] != explicit_cache_create
assert builtin_info["cache_read_input_token_cost"] != explicit_cache_read
model_keys = {
deploy_id: litellm.model_cost.get(deploy_id),
backend_model: copy.deepcopy(litellm.model_cost.get(backend_model)),
}
try:
Router(
model_list=[
{
"model_name": "claude-custom-explicit",
"litellm_params": {
"model": backend_model,
"api_key": "fake-key",
},
"model_info": {
"id": deploy_id,
"input_cost_per_token": 0.000003,
"output_cost_per_token": 0.000015,
"cache_creation_input_token_cost": explicit_cache_create,
"cache_read_input_token_cost": explicit_cache_read,
},
}
],
)
entry = litellm.model_cost[deploy_id]
assert entry.get("cache_creation_input_token_cost") == explicit_cache_create
assert entry.get("cache_read_input_token_cost") == explicit_cache_read
finally:
_restore_model_cost_entries(model_keys)
def test_inherit_builtin_cache_pricing_fills_only_missing_fields():
"""Direct unit test of the helper: missing cache fields are filled from the
backend model's built-in entry, while an explicitly set cache field and the
user's input/output pricing are left untouched.
"""
backend_model = "anthropic/claude-sonnet-4-5-20250929"
builtin_info = litellm.get_model_info(model=backend_model)
builtin_cache_create = builtin_info["cache_creation_input_token_cost"]
builtin_cache_read = builtin_info["cache_read_input_token_cost"]
assert builtin_cache_create is not None and builtin_cache_create > 0
assert builtin_cache_read is not None and builtin_cache_read > 0
explicit_cache_read = builtin_cache_read + 1
model_info = {
"input_cost_per_token": 0.000003,
"cache_read_input_token_cost": explicit_cache_read,
}
Router._inherit_builtin_cache_pricing(
model_info=model_info,
backend_model=backend_model,
custom_llm_provider="anthropic",
)
assert model_info["input_cost_per_token"] == 0.000003
assert model_info["cache_read_input_token_cost"] == explicit_cache_read
assert model_info["cache_creation_input_token_cost"] == builtin_cache_create
def test_inherit_builtin_cache_pricing_noop_for_unknown_backend():
"""No canonical entry for the backend model means the helper leaves the
passed-in dict unchanged rather than raising.
"""
model_info = {"input_cost_per_token": 0.000003}
Router._inherit_builtin_cache_pricing(
model_info=model_info,
backend_model="this-backend-model-does-not-exist-x9y8z7",
custom_llm_provider=None,
)
assert model_info == {"input_cost_per_token": 0.000003}