-
Notifications
You must be signed in to change notification settings - Fork 8.1k
Fix ingestion pipeline chunk handling and tokenizer validation #12994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR aims to fix ingestion pipeline failures when components pass output_format="chunks" payloads through HierarchicalMerger -> Splitter -> Tokenizer, specifically around chunk payload handling and tokenizer input validation (plus a small UI/schema validation addition).
Changes:
- Adjusts tokenizer upstream validation to treat
chunksas present when it’s provided (even if empty), and to only require JSON payloads whenjson_result is None. - Updates Splitter to consume upstream
chunkswhenoutput_format="chunks"instead of ignoring them and defaulting tojson_result. - Makes HierarchicalMerger tolerant of
Nonetext fields in chunk entries; adds cross-field validation in the chat settings schema.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
web/src/pages/next-chats/chat/app-settings/use-chat-setting-schema.tsx |
Adds cross-field Zod validation tying {knowledge} usage to KB selection / Tavily key. |
rag/flow/tokenizer/schema.py |
Relaxes/adjusts upstream payload validation rules for chunks and json_result. |
rag/flow/splitter/splitter.py |
Correctly routes upstream chunks into the JSON/chunk splitting path when output_format="chunks". |
rag/flow/hierarchical_merger/hierarchical_merger.py |
Avoids crashes when chunk text is None by normalizing to empty string during merge. |
rag/flow/tokenizer/schema.py
Outdated
| @@ -46,8 +46,8 @@ | |||
| if self.output_format == "text" and not self.text_result: | |||
| raise ValueError("output_format=text requires a text payload (field: 'text' or 'text_result').") | |||
| if self.output_format == "html" and not self.html_result: | |||
| raise ValueError("output_format=text requires a html payload (field: 'html' or 'html_result').") | |||
| else: | |||
| if not self.json_result and not self.chunks: | |||
| if self.json_result is None: | |||
| raise ValueError("When no chunks are provided and output_format is not markdown/text, a JSON list payload is required (field: 'json' or 'json_result').") | |||
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TokenizerFromUpstream now treats an empty chunks list as a valid payload (chunks is not None), but Tokenizer._invoke later checks if from_upstream.chunks: (truthy) and will fall back to json_result when chunks is []. If json_result is unset (common when upstream output_format="chunks"), this becomes None and the subsequent loop over chunks will crash. Either update the tokenizer implementation to treat chunks is not None as provided (including empty), or ensure the validated model provides a safe empty list for the fallback path when chunks is empty.
| if self.output_format == "text" and not self.text_result: | ||
| raise ValueError("output_format=text requires a text payload (field: 'text' or 'text_result').") | ||
| if self.output_format == "html" and not self.html_result: | ||
| raise ValueError("output_format=text requires a html payload (field: 'html' or 'html_result').") |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The html validation error message is incorrect: when output_format == "html" the exception currently says output_format=text requires a html payload..., which is misleading for users and makes debugging harder. Update the message to reference output_format=html (and consider fixing the article to "an html" if you’re touching it).
| raise ValueError("output_format=text requires a html payload (field: 'html' or 'html_result').") | |
| raise ValueError("output_format=html requires an html payload (field: 'html' or 'html_result').") |
What problem does this PR solve?
1.Fixes the ingestion pipeline path where HierarchicalMerger -> Splitter -> Tokenizer fails because output_format=chunks payloads were ignored and empty chunk lists triggered Tokenizer validation errors.
Fix ingestion pipeline chunk handling and hierarchical merger null text
2.Fixes ingestion pipeline failures where output_format=chunks payloads were ignored, empty chunk lists triggered Tokenizer validation errors, and HierarchicalMerger crashed on None text entries from docx
table blocks.
Type of change
Bug Fix (non-breaking change which fixes an issue)