Skip to content

refactor(streaming): remove stream_usage and fix streaming metadata capture#1624

Open
Pouyanpi wants to merge 1 commit intodevelopfrom
refactor/streaming-metadata-capture
Open

refactor(streaming): remove stream_usage and fix streaming metadata capture#1624
Pouyanpi wants to merge 1 commit intodevelopfrom
refactor/streaming-metadata-capture

Conversation

@Pouyanpi
Copy link
Collaborator

@Pouyanpi Pouyanpi commented Feb 6, 2026

Summary

  • Remove redundant stream_usage=True from _prepare_model_kwargs (OpenAI auto-enables it; ChatNVIDIA ignores it) and the defensive kwargs.pop("stream_usage", None) in langchain_initializer.py
  • Fix _stream_llm_call to properly capture metadata from streaming chunks using both response_metadata and usage_metadata (previously only read response_metadata which is empty during streaming)
  • Accumulate metadata across chunks (OpenAI sends response_metadata and usage_metadata on separate final chunks)
  • Rename include_generation_metadata to include_metadata with graceful deprecation (warning + removal in 0.22.0)
  • Rename internal field generation_info to metadata across streaming handler, runnable rails, and tests
  • Rename test utility constants for clarity (_TEST_PROVIDERS_WITH_TOKEN_USAGE_SUPPORT_TEST_PROVIDERS_WITH_TOKEN_USAGE)

Test plan

  • poetry run pytest tests/test_streaming_handler.py — 33 passed
  • poetry run pytest tests/runnable_rails/ — 173 passed
  • poetry run pytest tests/ -k "stream" — 134 passed
  • E2E verification (save the script below and run with poetry run python verify.py)
  • Manual streaming chat: poetry run nemoguardrails chat --config=./examples/bots/hello_world --streaming
  • Manual streaming with output rails (requires NVIDIA_API_KEY):
    sed -i '' '/check output/a\
        streaming:\
          enabled: True' ./examples/configs/nemoguards/config.yml
    poetry run nemoguardrails chat --config=./examples/configs/nemoguards --streaming
    git checkout ./examples/configs/nemoguards/config.yml
import asyncio
import os
import sys
import warnings

from nemoguardrails import LLMRails, RailsConfig


async def main():
    if not os.environ.get("OPENAI_API_KEY"):
        sys.exit("Set OPENAI_API_KEY first")

    config = RailsConfig.from_content(
        config={
            "models": [{"type": "main", "engine": "openai", "model": "gpt-4o"}],
            "streaming": True,
        },
    )

    rails = LLMRails(config)
    messages = [{"role": "user", "content": "Say hello in one word."}]

    # 1. Basic streaming returns plain strings
    chunks = []
    async for chunk in rails.stream_async(messages=messages):
        assert isinstance(chunk, str), f"Expected str, got {type(chunk)}"
        chunks.append(chunk)
    assert chunks and "".join(chunks), "No chunks received"

    # 2. Metadata streaming returns dicts with response_metadata + usage_metadata
    chunks = []
    async for chunk in rails.stream_async(messages=messages, include_metadata=True):
        assert isinstance(chunk, dict) and "text" in chunk
        chunks.append(chunk)
    metadata_chunks = [c for c in chunks if "metadata" in c]
    assert metadata_chunks, "No metadata in any chunk"
    meta = metadata_chunks[-1]["metadata"]
    assert "response_metadata" in meta, "Missing response_metadata"
    assert "usage_metadata" in meta, "Missing usage_metadata"

    # 3. Deprecated param still works and warns
    with warnings.catch_warnings(record=True) as w:
        warnings.simplefilter("always")
        async for _ in rails.stream_async(messages=messages, include_generation_metadata=True):
            pass
        assert any(issubclass(x.category, DeprecationWarning) for x in w)

    # 4. Non-streaming still works
    result = await rails.generate_async(messages=messages)
    assert result is not None

    print("All tests passed")


if __name__ == "__main__":
    asyncio.run(main())

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

Documentation preview

https://nvidia-nemo.github.io/Guardrails/review/pr-1624

@codecov
Copy link

codecov bot commented Feb 6, 2026

Codecov Report

❌ Patch coverage is 82.97872% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
nemoguardrails/actions/llm/utils.py 73.33% 4 Missing ⚠️
nemoguardrails/rails/llm/llmrails.py 60.00% 2 Missing ⚠️
nemoguardrails/streaming.py 90.90% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@Pouyanpi Pouyanpi force-pushed the refactor/streaming-metadata-capture branch from 7565f67 to 12411c8 Compare February 6, 2026 12:54
@Pouyanpi Pouyanpi marked this pull request as ready for review February 6, 2026 12:58
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 6, 2026

Greptile Overview

Greptile Summary

This PR refactors streaming metadata handling by removing redundant stream_usage=True parameter (OpenAI enables it automatically, ChatNVIDIA ignores it) and fixing metadata capture during streaming.

Key Changes:

  • Fixed _stream_llm_call to properly extract both response_metadata and usage_metadata from streaming chunks (previously only captured response_metadata which is empty during streaming)
  • Added metadata accumulation across chunks to handle OpenAI's pattern of sending response_metadata and usage_metadata in separate final chunks
  • Renamed include_generation_metadatainclude_metadata with graceful deprecation (warning + removal planned for 0.22.0)
  • Removed redundant stream_usage=True from model initialization and defensive kwargs.pop("stream_usage", None) cleanup code
  • Updated all internal references from generation_infometadata across streaming handler, runnable rails, and tests
  • Fixed END_OF_STREAM handling to set text to empty string instead of keeping the sentinel value
  • Added comprehensive test coverage for metadata accumulation behavior

The refactor is well-designed with proper backward compatibility through deprecation warnings, comprehensive test coverage (33 passing streaming tests, 173 runnable rails tests), and clear documentation updates.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The refactoring is well-executed with comprehensive test coverage (33 streaming tests, 173 runnable rails tests all passing), proper backward compatibility through deprecation warnings, clear documentation, and focused changes that address a specific bug (missing usage_metadata during streaming). The removal of redundant stream_usage parameter is justified and safe.
  • No files require special attention

Important Files Changed

Filename Overview
nemoguardrails/actions/llm/utils.py Fixed _stream_llm_call to properly extract and accumulate both response_metadata and usage_metadata from chunks using new _extract_chunk_metadata helper
nemoguardrails/streaming.py Renamed include_generation_metadata to include_metadata with deprecation warning, renamed generation_info to metadata, fixed END_OF_STREAM handling to set text to empty string
nemoguardrails/rails/llm/llmrails.py Removed redundant stream_usage=True from _prepare_model_kwargs, updated stream_async to use include_metadata parameter with deprecation warning for old parameter
nemoguardrails/integrations/langchain/runnable_rails.py Updated to use include_metadata instead of include_generation_metadata, renamed generation_info to metadata in chunk processing, added empty string filtering
tests/test_streaming_handler.py Updated tests to use metadata instead of generation_info, added new test test_metadata_accumulation_across_chunks to verify metadata accumulation behavior

Sequence Diagram

sequenceDiagram
    participant User
    participant LLMRails
    participant StreamingHandler
    participant _stream_llm_call
    participant LLM
    
    User->>LLMRails: stream_async(messages, include_metadata=True)
    LLMRails->>StreamingHandler: new StreamingHandler(include_metadata=True)
    LLMRails->>LLMRails: generate_async(streaming_handler)
    LLMRails->>_stream_llm_call: call with handler
    
    loop For each chunk from LLM
        _stream_llm_call->>LLM: llm.astream(messages)
        LLM-->>_stream_llm_call: chunk (with content)
        _stream_llm_call->>_stream_llm_call: _extract_chunk_metadata(chunk)
        Note over _stream_llm_call: Extract response_metadata<br/>and usage_metadata if present
        _stream_llm_call->>_stream_llm_call: accumulated_metadata.update(chunk_metadata)
        _stream_llm_call->>StreamingHandler: push_chunk(content, chunk_metadata)
        StreamingHandler->>StreamingHandler: current_metadata.update(metadata)
        StreamingHandler->>StreamingHandler: queue.put({"text": content, "metadata": current_metadata})
        StreamingHandler-->>User: yield {"text": content, "metadata": {...}}
    end
    
    _stream_llm_call->>_stream_llm_call: llm_response_metadata_var.set(accumulated_metadata)
    _stream_llm_call->>StreamingHandler: finish()
    StreamingHandler->>StreamingHandler: push_chunk(END_OF_STREAM)
    Note over StreamingHandler: END_OF_STREAM converted to<br/>{"text": "", "metadata": {...}}
    StreamingHandler-->>User: yield final chunk with all metadata
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@Pouyanpi Pouyanpi self-assigned this Feb 6, 2026
@Pouyanpi Pouyanpi added the enhancement New feature or request label Feb 6, 2026
@Pouyanpi Pouyanpi added this to the v0.21 milestone Feb 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant