litellm/scripts
Yassin Kortam 2eab9ee2c0
perf: reduce per-request and per-chunk overhead across Anthropic streaming hot paths (#28289)
* perf: reduce per-request and per-chunk overhead across Anthropic streaming hot paths

- Introduce pure-text fast-path in `_build_complete_streaming_response` that collapses O(N) `content_block_delta` events into a single equivalent SSE event before conversion, eliminating per-output-token Pydantic `ModelResponseStream` construction; non-text streams (tool_use, thinking, citations) fall back to the unchanged legacy path
- Skip agentic streaming wrapper entirely when no callback overrides `async_should_run_agentic_loop`; the wrapper buffered every chunk and rebuilt the SSE response only to call hooks that all return `(False, {})` — a pure no-op for the default config
- Serialize request body once (`json.dumps`) for both the pre-call log input and the wire, instead of twice; avoids a full O(payload) scan per request, significant for long-context Claude Code histories
- Add fast path in `async_streaming_data_generator` that bypasses the per-chunk `async_post_call_streaming_hook` coroutine await, response-string materialization, and cost-injection call when no callback/guardrail/cost-injection is active (the default config)
- Resolve `_DD_STREAMING_TRACE_ENABLED` once at import time; eliminate per-chunk `NullSpan` context manager allocation when Datadog tracing is disabled (the default)
- Memoize `get_type_hints(AnthropicMessagesRequestOptionalParams)` with `@lru_cache(maxsize=1)` — resolves once per process instead of once per `/v1/messages` request (~80µs each)
- Hoist `cost_injection_active` out of the per-chunk loop in `chunk_processor`; eliminates repeated `getattr` + endpoint-type checks on every streamed byte chunk
- Extract `_build_passthrough_logging_result` from `_route_streaming_logging_to_handler` as a standalone static method to facilitate future off-loop dispatch
- Convert `async_sse_data_generator` from an `async for: yield` trampoline to a direct return of the underlying generator, removing one async-generator layer per streamed chunk
- Skip redundant `strip_empty_text_blocks_from_anthropic_messages` scan in `anthropic_messages_handler` when the async wrapper already sanitized (signalled via `_litellm_messages_presanitized` sentinel, popped before reaching provider params)
- Gate debug log `f-string` evaluation behind `isEnabledFor(DEBUG)` in both the streaming generator and the transformation layer to avoid serializing entire message payloads on every request at non-debug log levels
- Add benchmark script (`scripts/benchmark_anthropic_messages_perf.py`) with a local mock Anthropic SSE provider for reproducible TTFT and TPM measurement across commits/branches
- Add parity tests asserting fast-path and legacy-path produce byte-identical logged/billed payloads, plus unit tests for agentic hook detection, pre-serialized body reuse, and memoized key resolution

* perf: address greptile review for anthropic streaming hot path

- Bail to legacy in `_collapse_pure_text_chunks` when content_block_delta
  events from different block indexes are observed without an intervening
  flush. Anthropic sends blocks strictly sequentially, but defensive bail
  prevents silent text-merging if the protocol ever interleaves.
- Replace leaf-class `__dict__` check for `async_post_call_streaming_hook`
  in `_callback_capabilities` with a function-identity comparison that
  walks the MRO. A vendor base class can carry the override and the
  registered class can add nothing else; before this PR the hook was
  unconditionally invoked, so an inherited-override miss would silently
  drop the hook on the streaming path.
- Add unit tests for both behaviors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(mypy): narrow model_name to str in cost-injection branch

The hoisted cost_injection_active flag in chunk_processor encodes the
`bool(model_name)` requirement but mypy can't track that invariant
through the local, so the per-chunk `_process_chunk_with_cost_injection(
chunk, model_name)` calls flagged Optional[str] vs str. Pin a typed
non-None local inside the cost-injection branch so mypy narrows
correctly without changing runtime behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 12:15:59 -07:00
..
adaptive_router_demo feat: commit new adaptive routing 2026-04-18 21:29:39 -07:00
health_check style: run black formatter on files from main merge 2026-04-17 13:02:59 -07:00
benchmark_anthropic_messages_perf.py perf: reduce per-request and per-chunk overhead across Anthropic streaming hot paths (#28289) 2026-05-23 12:15:59 -07:00
benchmark_chat_completions_perf.py perf: eliminate per-request callback scanning on proxy hot path (#27858) 2026-05-14 09:28:31 -07:00
benchmark_mock.py style: run black formatter on files from main merge 2026-04-17 13:02:59 -07:00
benchmark_proxy_vs_provider.py style: run black formatter on files from main merge 2026-04-17 13:02:59 -07:00
create_litellm_branch.ps1 feat: add script to create branches with litellm_ prefix (#17606) 2025-12-06 10:41:39 -08:00
create_litellm_branch.sh enhance: create_litellm_branch tool to be more robust (#17874) 2025-12-12 05:35:50 -08:00
create_team_key_and_submit_guardrail.sh feat(guardrails): team-based guardrail registration and approval workflow (#22459) 2026-03-02 22:06:49 -08:00
eval_compression.py Prompt Compression - add it to the proxy (#25729) 2026-04-20 15:08:00 -07:00
install.sh build: migrate packaging, CI, and Docker from Poetry to uv (#25007) 2026-04-09 11:46:23 -07:00
mock_bedrock_passthrough_target.py Refactor Bedrock response stream shape handling (#27257) 2026-05-06 17:39:38 -07:00
mock_grayswan_timeout_server.py implement failopen option default to True on grayswan guardrail (#18266) 2026-01-06 15:17:05 +05:30
mutation_report.py ci: add manually-triggered mutation testing workflow (#27576) 2026-05-11 15:19:57 -07:00
test_agent_mcp_endpoints.sh Agents - assign tools (#22064) 2026-02-25 11:44:30 -08:00
test_guardrails_register_endpoints.sh feat(guardrails): team-based guardrail registration and approval workflow (#22459) 2026-03-02 22:06:49 -08:00
test_tool_allowlist_script.py style: run black formatter on files from main merge 2026-04-17 13:02:59 -07:00
tpm_headline_test.sh fix: atomic TPM rate limit (#27001) 2026-05-05 16:58:07 -07:00
verify_adaptive_router.py feat: add adaptive routing to litellm 2026-04-18 16:35:17 -07:00