Skip to content

Conversation

@filipi87
Copy link
Contributor

@filipi87 filipi87 commented Feb 2, 2026

Summary

  • Added automatic context summarization to compress conversation history when token limits are approached, enabling efficient long running conversations
  • Implemented token estimation using character-count heuristic (1 token ≈ 4 characters)
  • Added smart message selection that preserves system messages, recent context, and incomplete function call sequences
  • Configured via enable_context_summarization=True in LLMUserAggregatorParams with customizable thresholds and behavior
  • Runs summarization asynchronously in background tasks to avoid blocking the pipeline

Key Features

  • Automatic triggering: Summarization triggers when context exceeds 80% of max tokens (default 8000) or after 20 unsummarized messages
  • Function call awareness: Never summarizes incomplete tool call sequences, preserving request-response pairing integrity
  • Interruption handling: Cancels pending summarizations when user interrupts to avoid stale results
  • Configurable preservation: Keeps configurable number of recent messages (default: 4) uncompressed for immediate context

Configuration

from pipecat.processors.aggregators.llm_response_universal import (
    LLMUserAggregatorParams,
)
from pipecat.utils.context.llm_context_summarization import (
    LLMContextSummarizationConfig,
)

user_aggregator_params = LLMUserAggregatorParams(
    enable_context_summarization=True,
    context_summarization_config=LLMContextSummarizationConfig(
        max_context_tokens=8000,           # Maximum context size
        summarization_threshold=0.8,        # Trigger at 80% of max
        max_unsummarized_messages=20,       # Or after 20 new messages
        min_messages_after_summary=4,       # Keep last 4 messages
        summarization_prompt=None,          # Optional custom prompt
    )
)

Testing

Run the new test suite:

uv run pytest tests/test_context_summarization.py

Try the examples:

# OpenAI example with function calling
uv run examples/foundational/54-context-summarization-openai.py

# Google Gemini example
uv run examples/foundational/54a-context-summarization-google.py

Implementation Details

New Components:

  • src/pipecat/utils/context/llm_context_summarization.py: Core utility with token estimation, message selection, and formatting
  • LLMContextSummaryRequestFrame and LLMContextSummaryResultFrame: New control frames for async summarization flow
  • LLMContextSummarizationConfig: Configuration dataclass with validation

Modified Components:

  • LLMUserAggregator: Added summarization trigger logic, state tracking, and result handling
  • LLMService: Added async summary generation using run_inference() with max_tokens override

@codecov
Copy link

codecov bot commented Feb 3, 2026

Codecov Report

❌ Patch coverage is 84.83965% with 52 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...pipecat/utils/context/llm_context_summarization.py 84.13% 23 Missing ⚠️
...t/processors/aggregators/llm_context_summarizer.py 90.82% 10 Missing ⚠️
src/pipecat/services/llm_service.py 81.39% 8 Missing ⚠️
...t/processors/aggregators/llm_response_universal.py 71.42% 6 Missing ⚠️
src/pipecat/services/openai/base_llm.py 40.00% 3 Missing ⚠️
src/pipecat/services/aws/llm.py 66.66% 1 Missing ⚠️
src/pipecat/services/google/llm.py 66.66% 1 Missing ⚠️
Files with missing lines Coverage Δ
src/pipecat/frames/frames.py 89.44% <100.00%> (+0.21%) ⬆️
src/pipecat/services/anthropic/llm.py 39.05% <100.00%> (ø)
src/pipecat/services/aws/llm.py 34.66% <66.66%> (+0.06%) ⬆️
src/pipecat/services/google/llm.py 41.79% <66.66%> (+0.04%) ⬆️
src/pipecat/services/openai/base_llm.py 55.17% <40.00%> (-0.71%) ⬇️
...t/processors/aggregators/llm_response_universal.py 78.82% <71.42%> (-0.37%) ⬇️
src/pipecat/services/llm_service.py 44.21% <81.39%> (+6.76%) ⬆️
...t/processors/aggregators/llm_context_summarizer.py 90.82% <90.82%> (ø)
...pipecat/utils/context/llm_context_summarization.py 84.13% <84.13%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@filipi87 filipi87 marked this pull request as ready for review February 4, 2026 18:38
@markbackman markbackman requested a review from kompfner February 4, 2026 22:12
Copy link
Contributor

@markbackman markbackman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really great! Very clean and easy to understand. Nice work 👏

I'm going to do some testing and will let you know if I find anything.

@@ -0,0 +1,307 @@
# Code Cleanup Skill
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine to start. We will probably want to iterate until we land on something optimized.

@filipi87 filipi87 requested a review from markbackman February 6, 2026 21:47
Copy link
Contributor

@markbackman markbackman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the clean up.

From the first review, the only item I still see open is this one (regarding the base summary prompt):
https://github.com/pipecat-ai/pipecat/pull/3621/changes#r2775817727

Aside from that, this looks good to go. It probably makes sense to get input from someone else too since this is such a key feature and will get a ton of use.

@filipi87
Copy link
Contributor Author

filipi87 commented Feb 9, 2026

I have missed that one. Fixed. Thank you for the review @markbackman. 🙌

@filipi87 filipi87 force-pushed the filipi/context_compressure branch from 69b4a10 to 161ede2 Compare February 9, 2026 13:40
class LLMContextSummaryRequestFrame(ControlFrame):
"""Frame requesting context summarization from an LLM service.

Sent by aggregators to LLM services when conversation context needs to be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might also be worth describing what the LLMs then do with that summary (i.e. push a LLMContextSummaryResultFrame, right?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was describing what they do inside LLMContextSummaryResultFrame.

context: The full LLM context containing all messages to analyze and summarize.
min_messages_to_keep: Number of recent messages to preserve uncompressed.
These messages will not be included in the summary.
max_context_tokens: Maximum allowed context size in tokens. The LLM should
Copy link
Contributor

@kompfner kompfner Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feedback applies throughout: should we transparent about the fact that it's approximate tokens, by maybe calling it something like max_approx_context_tokens (and updating docstring comments accordingly to let developers know what they're dealing with)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name max_context_tokens is correct here, because this is what we pass to the LLM when running inference to enforce the maximum number of tokens.

The one inside LLMContextSummarizationConfig, in that case, what the user specifies is only an approximation, since the way we calculate the tokens is approximate. Even so, in my opinion we should keep the same name there, but make it clear in the docstring that the token calculation is an approximation.

What do you think ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, for now I have improved the description of max_context_tokens, inside LLMContextSummarizationConfig, explaining how the tokens are calculated.



@dataclass
class LLMContextSummaryRequestFrame(ControlFrame):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: do we want to consume these frames in the LLMs, or let them continue down the pipeline, just in case anyone wants to handle them in custom processors?

My inclination is that these seem fine to consume in the LLMs...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed in yesterday’s meeting, in this case, since we’re actually handling the frame and doing something with it, creating the summary, it feels more natural to consume the frame here rather than let it continue.

self._params.context_summarization_config or LLMContextSummarizationConfig()
)
self._summarization_in_progress = False
self._pending_summary_request_id: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need both of the above state variables? could the presence of _pending_summary_request_id indicate that a summarization is in progress?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that could work. 👍

for s in self._params.user_mute_strategies:
await s.cleanup()

async def _clear_summarization(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: _clear_summarization_state might be clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# Apply summary
await self._apply_summary(frame.summary, frame.last_summarized_index)

def _validate_summary_context(self, last_summarized_index: int) -> bool:
Copy link
Contributor

@kompfner kompfner Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function safeguarding against the possibility of programmatic edits to the context, like from LLMMessageUpdateFrames and the like? If so, then in a PR I worked on recently (which maybe you've looked through already) I added some mechanisms for tracking with more certainty whether a context has been edited...wonder if we could join forces and use those here to determine with more certainty whether a summary still applies.

"""
messages = self._context.messages

# Find first system message (if any)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a very fringe case, but should we handle the possibility of the first system message appearing later than the last summarized index? There's technically no hard requirement that a "system"-role message has to appear at or near the beginning of the conversation, esp. with providers like OpenAI.

Or...does the summarization process already exclude the first system message? (I should probably just read on to find out, but wanted to jot this note down here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using OpenAI (where "system"-role messages can appear anywhere), forcing the system message to appear at the beginning on summarization might mess with conversation flow. But on the other hand, summarization does needs to work universally, and other providers don't handle "system"-role messages anywhere...

With Gemini, we "pull" the first system instruction out of the messages and use it as the overall system instruction (which it seems like the logic here is modeled after). But with AWS Bedrock, we only pull a "system"-role message our of messages and use it as system instruction if it's the first message. We're inconsistent, which isn't ideal...

As I think about it, two approaches come to mind:

  • What you have here
  • Only checking the very first message in messages for a system message

Almost always, those two are the same. So in practice I don't know if this makes much of a difference.

But it's a good reminder that we should probably do a consistency pass on how we translate "system"-role messages for different providers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am doing in this PR is finding the index of the first system message and defining summary_start = first_system_index + 1.

I then summarize only the messages that come after the first system message, or everything if there is no system message.

)

# Calculate max_tokens for the summary using utility method
max_summary_tokens = LLMContextSummarizationUtil.calculate_max_summary_tokens(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding: the max_context_tokens that the developer specifies as the point where a summary should be triggered is the same number used to compute how big the summary should be?

It seems like how big you want the summary to be should be (at least somewhat) independent—you might want to ask the summary to be relatively compact so you don't have to do it as often, rather than letting it take up all the remaining space, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It basically is, but when calculating the available space to define the max_tokens that I pass to the LLM, I always apply a 0.8 buffer to keep the summary at a maximum of 80% of the available space.

But I think you’re right, we should probably create a:

  • target_context_tokens: the target maximum context size in tokens after summarization.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I have created target_context_tokens

token_limit_exceeded = total_tokens >= token_limit

# Check if we've exceeded max unsummarized messages
messages_since_summary = len(self._context.messages) - 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do you want to count the possible initial system message towards the limit? if not, you might have to subtract 2, no?

filter_incomplete_user_turns: bool = False
user_turn_completion_config: Optional[UserTurnCompletionConfig] = None
enable_context_summarization: bool = False
context_summarization_config: Optional[LLMContextSummarizationConfig] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for adding this into the user aggregator instead of the assistant one? Just curious.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually, seems to be this should be done in the assistant, feels more natural, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to include it there because each time the user started a new turn, and we pushed a new context frame, I thought it was a nice moment to also push, as a follow up, a frame requesting summarization if needed.

This way, the LLM would have time to process it while the TTS was generating the previous answer.

So it felt like a good spot to add this logic without impacting performance.

But like we discussed on slack, we can achieve something similar using the assistant aggregator, if we use the LLMFullResponseStartFrame to trigger if we should or not request the context summarization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

logger.debug(f"{self}: Processing summarization request {frame.request_id}")

# Create a background task to generate the summary without blocking
self.create_task(self._generate_summary_task(frame))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should save a reference to the task and cancel it on cleanup if necessary. we should also call

await asyncio.sleep(0)

to schedule the task in the event loop.

logger.debug(f"{self}: Processing summarization request {frame.request_id}")

# Create a background task to generate the summary without blocking
self.create_task(self._generate_summary_task(frame))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still need to keep track of this task. and, since we don't await it, we need to call async asyncio.sleep(0).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I have just pushed the fix to keep track of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I am not sure why we need this ?
asyncio.sleep(0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I havent added this one yet. Is it really needed ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't cancel the task during interruption, then no, not needed. The reason is that if you cancel a task before the task is started by the event loop (different than created) you will get RuntimeWarnings saying that the task was never awaited. Since we don't cancel the task during interruptions, I think we should be ok.

@aconchillo
Copy link
Contributor

LGTM! 👏

@filipi87 filipi87 force-pushed the filipi/context_compressure branch from 5a3edb6 to 2475697 Compare February 10, 2026 21:59
@filipi87 filipi87 merged commit a98c884 into main Feb 10, 2026
6 checks passed
@filipi87 filipi87 deleted the filipi/context_compressure branch February 10, 2026 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants