Feature Request: Collect Usage Data from Streaming Responses


Hey Vitalii! 👋

please find below my solution (iteratively AI-generated text—now improved English—and my tested implementation) for propagating response usage information to higher layers, e.g., your lm-proxy. :-)

Best regards,
Daniel
 
# Feature Request: Collect Usage Data from Streaming Responses

## Problem

Currently, when using streaming mode with OpenAI-compatible APIs, microcore's `LLMResponse` doesn't include usage statistics (`prompt_tokens`, `completion_tokens`, `total_tokens`, and provider-specific fields like `cost`). This makes it impossible to track token consumption and costs in streaming mode, which is essential for monitoring and logging purposes.

This issue came up while working on [lm-proxy#3](https://github.com/Nayjest/lm-proxy/issues/3) where we needed to log usage data for both streaming and non-streaming requests.

### Real-World Use Case: lm-proxy

In [lm-proxy](https://github.com/Nayjest/lm-proxy) (an OpenAI-compatible proxy for routing requests to multiple LLM providers), we need to log token usage and costs for both streaming and non-streaming requests. 
``

```toml
[loggers.entry_transformer]
class = 'lm_proxy.loggers.LogEntryTransformer'
request           = "request"
response          = "response"
response_raw      = "response.raw"
response_dict     = "response"
# Non-streaming: response.usage.* (OpenAI response object)
completion_tokens = "response.usage.completion_tokens"
prompt_tokens     = "response.usage.prompt_tokens"
total_tokens      = "response.usage.total_tokens"
cost              = "response.usage.cost"
# Streaming: response.* (LLMResponse direct attributes)
completion_tokens_stream = "response.completion_tokens"
prompt_tokens_stream     = "response.prompt_tokens"
total_tokens_stream      = "response.total_tokens"
cost_stream              = "response.cost"
#prompt            = "request.messages"
group             = "group"
connection        = "connection"
api_key_id        = "api_key_id"
remote_addr       = "remote_addr"
created_at        = "created_at"
duration          = "duration"

```

Without usage data in streaming responses, tracking costs and monitoring token consumption becomes impossible for streaming requests, creating a major gap in production logging and billing systems.


## Background

According to the [OpenAI API specification](https://platform.openai.com/docs/api-reference/chat/object), all responses include a `usage` object with token statistics:

```json
{
  "usage": {
    "prompt_tokens": 19,
    "completion_tokens": 10,
    "total_tokens": 29,
    "completion_tokens_details": {...},
    "prompt_tokens_details": {...}
  }
}
```

For streaming responses, OpenAI sends this usage data in the **final chunk** after all content has been streamed ([OpenAI Cookbook: How to stream completions](https://cookbook.openai.com/examples/how_to_stream_completions#4-how-to-get-token-usage-data-for-streamed-chat-completion-response)):

> "You can get token usage statistics for your streamed response by setting `stream_options={"include_usage": True}`. When you do so, **an extra chunk will be streamed as the final chunk**. You can access the usage data for the entire request via the `usage` field on this chunk."


## Proposed Solution

Collect the usage data from the final stream chunk and pass it to `LLMResponse` so it's available for logging and monitoring, just like with non-streaming responses.

## Implementation

The fix involves updating both streaming functions in `microcore/llm/openai.py` to capture usage data from the last chunk:

### Async Version (`_a_process_streamed_response`)

```python
async def _a_process_streamed_response(
    response,
    callbacks: list[callable],
    chat_model_used: bool,
    hidden_output_begin: str | None = None,
    hidden_output_end: str | None = None,
):
    response_text: str = ""
    hiding: bool = False
    need_to_hide = hidden_output_begin and hidden_output_end
    **last_chunk_data = {}**
    async for chunk in response:
        **# Collect all data from last chunk's usage object
        if usage := getattr(chunk, 'usage', None):
            last_chunk_data = usage.model_dump()**
        if text_chunk := _get_chunk_text(chunk, chat_model_used):
            if need_to_hide:
                if text_chunk == hidden_output_begin:
                    hiding = True
                    continue
                if hiding:
                    if text_chunk == hidden_output_end:
                        hiding = False
                        text_chunk = ""
                    else:
                        continue
            response_text += text_chunk
            for cb in callbacks:
                if asyncio.iscoroutinefunction(cb):
                    await cb(text_chunk)
                else:
                    cb(text_chunk)
    return LLMResponse(response_text, **last_chunk_data**)
```

### Sync Version (`_process_streamed_response`)

```python
def _process_streamed_response(
    response,
    callbacks: list[callable],
    chat_model_used: bool,
    hidden_output_begin: str | None = None,
    hidden_output_end: str | None = None,
):
    response_text: str = ""
    is_hiding: bool = False
    need_to_hide = hidden_output_begin and hidden_output_end
    last_chunk_data = {}
    for chunk in response:
        # Collect all data from last chunk's usage object
        if usage := getattr(chunk, 'usage', None):
            last_chunk_data = usage.model_dump()
        if text_chunk := _get_chunk_text(chunk, chat_model_used):
            if need_to_hide:
                if text_chunk == hidden_output_begin:
                    is_hiding = True
                    continue
                if is_hiding:
                    if text_chunk == hidden_output_end:
                        is_hiding = False
                        text_chunk = ""
                    else:
                        continue
            response_text += text_chunk
            [cb(text_chunk) for cb in callbacks]
    return LLMResponse(response_text, last_chunk_data)
```

## Key Changes

1. **Initialize `last_chunk_data = {}`** at the start of streaming
2. **Capture usage data** from chunks using `usage.model_dump()` (Pydantic's standard method)
3. **Pass collected data** to `LLMResponse` constructor
4. **Works with any OpenAI-compatible API** (OpenAI, OpenRouter, Anthropic, etc.)
5. **Automatically includes all fields** - standard fields (`prompt_tokens`, `completion_tokens`, `total_tokens`) and provider-specific extensions (`cost`, `cost_details`, etc.)

## Benefits

- ✅ **Consistent behavior**: Streaming responses now include usage data, just like non-streaming
- ✅ **OpenAI standard compliant**: Uses Pydantic's `model_dump()` method
- ✅ **Provider agnostic**: Works with OpenAI, OpenRouter, and any OpenAI-compatible API
- ✅ **Future-proof**: Automatically captures new fields as providers add them
- ✅ **Essential for monitoring**: Enables proper cost tracking and token usage logging

## Testing

Tested successfully with:
- OpenRouter API (streaming + non-streaming)
- All usage fields correctly captured: `prompt_tokens`, `completion_tokens`, `total_tokens`, `cost`
- Works with both async and sync versions

Would love to see this merged! It's a small change that makes streaming responses feature-complete. 🎯

Let me know if you need any adjustments or have questions!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Collect Usage Data from Streaming Responses #123

Feature Request: Collect Usage Data from Streaming Responses

Problem

Real-World Use Case: lm-proxy

Background

Proposed Solution

Implementation

Async Version (`_a_process_streamed_response`)

Sync Version (`_process_streamed_response`)

Key Changes

Benefits

Testing

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature Request: Collect Usage Data from Streaming Responses #123

Description

Feature Request: Collect Usage Data from Streaming Responses

Problem

Real-World Use Case: lm-proxy

Background

Proposed Solution

Implementation

Async Version (_a_process_streamed_response)

Sync Version (_process_streamed_response)

Key Changes

Benefits

Testing

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Async Version (`_a_process_streamed_response`)

Sync Version (`_process_streamed_response`)