Skip to content

Conversation

@LiruiYu33
Copy link
Contributor

@LiruiYu33 LiruiYu33 commented Feb 4, 2026

What problem does this PR solve?
1.Fixes the ingestion pipeline path where HierarchicalMerger -> Splitter -> Tokenizer fails because output_format=chunks payloads were ignored and empty chunk lists triggered Tokenizer validation errors.
Fix ingestion pipeline chunk handling and hierarchical merger null text
2.Fixes ingestion pipeline failures where output_format=chunks payloads were ignored, empty chunk lists triggered Tokenizer validation errors, and HierarchicalMerger crashed on None text entries from docx
table blocks.

Type of change
Bug Fix (non-breaking change which fixes an issue)

img_v3_02uj_b60ec9e7-966c-4390-9e74-11e40548c73g

img_v3_02uk_0070922d-237c-435c-8a8b-82a3a3618a6g

Copilot AI review requested due to automatic review settings February 4, 2026 12:07
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Feb 4, 2026
@dosubot dosubot bot added 🐞 bug Something isn't working, pull request that fix bug. size:M This PR changes 30-99 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Feb 4, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to fix ingestion pipeline failures when components pass output_format="chunks" payloads through HierarchicalMerger -> Splitter -> Tokenizer, specifically around chunk payload handling and tokenizer input validation (plus a small UI/schema validation addition).

Changes:

  • Adjusts tokenizer upstream validation to treat chunks as present when it’s provided (even if empty), and to only require JSON payloads when json_result is None.
  • Updates Splitter to consume upstream chunks when output_format="chunks" instead of ignoring them and defaulting to json_result.
  • Makes HierarchicalMerger tolerant of None text fields in chunk entries; adds cross-field validation in the chat settings schema.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
web/src/pages/next-chats/chat/app-settings/use-chat-setting-schema.tsx Adds cross-field Zod validation tying {knowledge} usage to KB selection / Tavily key.
rag/flow/tokenizer/schema.py Relaxes/adjusts upstream payload validation rules for chunks and json_result.
rag/flow/splitter/splitter.py Correctly routes upstream chunks into the JSON/chunk splitting path when output_format="chunks".
rag/flow/hierarchical_merger/hierarchical_merger.py Avoids crashes when chunk text is None by normalizing to empty string during merge.

Comment on lines 40 to 52
@@ -46,8 +46,8 @@
if self.output_format == "text" and not self.text_result:
raise ValueError("output_format=text requires a text payload (field: 'text' or 'text_result').")
if self.output_format == "html" and not self.html_result:
raise ValueError("output_format=text requires a html payload (field: 'html' or 'html_result').")
else:
if not self.json_result and not self.chunks:
if self.json_result is None:
raise ValueError("When no chunks are provided and output_format is not markdown/text, a JSON list payload is required (field: 'json' or 'json_result').")
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TokenizerFromUpstream now treats an empty chunks list as a valid payload (chunks is not None), but Tokenizer._invoke later checks if from_upstream.chunks: (truthy) and will fall back to json_result when chunks is []. If json_result is unset (common when upstream output_format="chunks"), this becomes None and the subsequent loop over chunks will crash. Either update the tokenizer implementation to treat chunks is not None as provided (including empty), or ensure the validated model provides a safe empty list for the fallback path when chunks is empty.

Copilot uses AI. Check for mistakes.
if self.output_format == "text" and not self.text_result:
raise ValueError("output_format=text requires a text payload (field: 'text' or 'text_result').")
if self.output_format == "html" and not self.html_result:
raise ValueError("output_format=text requires a html payload (field: 'html' or 'html_result').")
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The html validation error message is incorrect: when output_format == "html" the exception currently says output_format=text requires a html payload..., which is misleading for users and makes debugging harder. Update the message to reference output_format=html (and consider fixing the article to "an html" if you’re touching it).

Suggested change
raise ValueError("output_format=text requires a html payload (field: 'html' or 'html_result').")
raise ValueError("output_format=html requires an html payload (field: 'html' or 'html_result').")

Copilot uses AI. Check for mistakes.
@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Feb 5, 2026
@LiruiYu33 LiruiYu33 closed this Feb 5, 2026
@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🐞 bug Something isn't working, pull request that fix bug. size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant