-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Hey Vitalii! 👋
please find below my solution (iteratively AI-generated text—now improved English—and my tested implementation) for propagating response usage information to higher layers, e.g., your lm-proxy. :-)
Best regards,
Daniel
Feature Request: Collect Usage Data from Streaming Responses
Problem
Currently, when using streaming mode with OpenAI-compatible APIs, microcore's LLMResponse doesn't include usage statistics (prompt_tokens, completion_tokens, total_tokens, and provider-specific fields like cost). This makes it impossible to track token consumption and costs in streaming mode, which is essential for monitoring and logging purposes.
This issue came up while working on lm-proxy#3 where we needed to log usage data for both streaming and non-streaming requests.
Real-World Use Case: lm-proxy
In lm-proxy (an OpenAI-compatible proxy for routing requests to multiple LLM providers), we need to log token usage and costs for both streaming and non-streaming requests.
``
[loggers.entry_transformer]
class = 'lm_proxy.loggers.LogEntryTransformer'
request = "request"
response = "response"
response_raw = "response.raw"
response_dict = "response"
# Non-streaming: response.usage.* (OpenAI response object)
completion_tokens = "response.usage.completion_tokens"
prompt_tokens = "response.usage.prompt_tokens"
total_tokens = "response.usage.total_tokens"
cost = "response.usage.cost"
# Streaming: response.* (LLMResponse direct attributes)
completion_tokens_stream = "response.completion_tokens"
prompt_tokens_stream = "response.prompt_tokens"
total_tokens_stream = "response.total_tokens"
cost_stream = "response.cost"
#prompt = "request.messages"
group = "group"
connection = "connection"
api_key_id = "api_key_id"
remote_addr = "remote_addr"
created_at = "created_at"
duration = "duration"
Without usage data in streaming responses, tracking costs and monitoring token consumption becomes impossible for streaming requests, creating a major gap in production logging and billing systems.
Background
According to the OpenAI API specification, all responses include a usage object with token statistics:
{
"usage": {
"prompt_tokens": 19,
"completion_tokens": 10,
"total_tokens": 29,
"completion_tokens_details": {...},
"prompt_tokens_details": {...}
}
}For streaming responses, OpenAI sends this usage data in the final chunk after all content has been streamed (OpenAI Cookbook: How to stream completions):
"You can get token usage statistics for your streamed response by setting
stream_options={"include_usage": True}. When you do so, an extra chunk will be streamed as the final chunk. You can access the usage data for the entire request via theusagefield on this chunk."
Proposed Solution
Collect the usage data from the final stream chunk and pass it to LLMResponse so it's available for logging and monitoring, just like with non-streaming responses.
Implementation
The fix involves updating both streaming functions in microcore/llm/openai.py to capture usage data from the last chunk:
Async Version (_a_process_streamed_response)
async def _a_process_streamed_response(
response,
callbacks: list[callable],
chat_model_used: bool,
hidden_output_begin: str | None = None,
hidden_output_end: str | None = None,
):
response_text: str = ""
hiding: bool = False
need_to_hide = hidden_output_begin and hidden_output_end
**last_chunk_data = {}**
async for chunk in response:
**# Collect all data from last chunk's usage object
if usage := getattr(chunk, 'usage', None):
last_chunk_data = usage.model_dump()**
if text_chunk := _get_chunk_text(chunk, chat_model_used):
if need_to_hide:
if text_chunk == hidden_output_begin:
hiding = True
continue
if hiding:
if text_chunk == hidden_output_end:
hiding = False
text_chunk = ""
else:
continue
response_text += text_chunk
for cb in callbacks:
if asyncio.iscoroutinefunction(cb):
await cb(text_chunk)
else:
cb(text_chunk)
return LLMResponse(response_text, **last_chunk_data**)Sync Version (_process_streamed_response)
def _process_streamed_response(
response,
callbacks: list[callable],
chat_model_used: bool,
hidden_output_begin: str | None = None,
hidden_output_end: str | None = None,
):
response_text: str = ""
is_hiding: bool = False
need_to_hide = hidden_output_begin and hidden_output_end
last_chunk_data = {}
for chunk in response:
# Collect all data from last chunk's usage object
if usage := getattr(chunk, 'usage', None):
last_chunk_data = usage.model_dump()
if text_chunk := _get_chunk_text(chunk, chat_model_used):
if need_to_hide:
if text_chunk == hidden_output_begin:
is_hiding = True
continue
if is_hiding:
if text_chunk == hidden_output_end:
is_hiding = False
text_chunk = ""
else:
continue
response_text += text_chunk
[cb(text_chunk) for cb in callbacks]
return LLMResponse(response_text, last_chunk_data)Key Changes
- Initialize
last_chunk_data = {}at the start of streaming - Capture usage data from chunks using
usage.model_dump()(Pydantic's standard method) - Pass collected data to
LLMResponseconstructor - Works with any OpenAI-compatible API (OpenAI, OpenRouter, Anthropic, etc.)
- Automatically includes all fields - standard fields (
prompt_tokens,completion_tokens,total_tokens) and provider-specific extensions (cost,cost_details, etc.)
Benefits
- ✅ Consistent behavior: Streaming responses now include usage data, just like non-streaming
- ✅ OpenAI standard compliant: Uses Pydantic's
model_dump()method - ✅ Provider agnostic: Works with OpenAI, OpenRouter, and any OpenAI-compatible API
- ✅ Future-proof: Automatically captures new fields as providers add them
- ✅ Essential for monitoring: Enables proper cost tracking and token usage logging
Testing
Tested successfully with:
- OpenRouter API (streaming + non-streaming)
- All usage fields correctly captured:
prompt_tokens,completion_tokens,total_tokens,cost - Works with both async and sync versions
Would love to see this merged! It's a small change that makes streaming responses feature-complete. 🎯
Let me know if you need any adjustments or have questions!