Fix ingestion pipeline chunk handling and tokenizer validation #12994

LiruiYu33 · 2026-02-04T12:07:34Z

What problem does this PR solve?
1.Fixes the ingestion pipeline path where HierarchicalMerger -> Splitter -> Tokenizer fails because output_format=chunks payloads were ignored and empty chunk lists triggered Tokenizer validation errors.
Fix ingestion pipeline chunk handling and hierarchical merger null text
2.Fixes ingestion pipeline failures where output_format=chunks payloads were ignored, empty chunk lists triggered Tokenizer validation errors, and HierarchicalMerger crashed on None text entries from docx
table blocks.

Type of change
Bug Fix (non-breaking change which fixes an issue)

img_v3_02uj_b60ec9e7-966c-4390-9e74-11e40548c73g

Copilot

Pull request overview

This PR aims to fix ingestion pipeline failures when components pass output_format="chunks" payloads through HierarchicalMerger -> Splitter -> Tokenizer, specifically around chunk payload handling and tokenizer input validation (plus a small UI/schema validation addition).

Changes:

Adjusts tokenizer upstream validation to treat chunks as present when it’s provided (even if empty), and to only require JSON payloads when json_result is None.
Updates Splitter to consume upstream chunks when output_format="chunks" instead of ignoring them and defaulting to json_result.
Makes HierarchicalMerger tolerant of None text fields in chunk entries; adds cross-field validation in the chat settings schema.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`web/src/pages/next-chats/chat/app-settings/use-chat-setting-schema.tsx`	Adds cross-field Zod validation tying `{knowledge}` usage to KB selection / Tavily key.
`rag/flow/tokenizer/schema.py`	Relaxes/adjusts upstream payload validation rules for `chunks` and `json_result`.
`rag/flow/splitter/splitter.py`	Correctly routes upstream `chunks` into the JSON/chunk splitting path when `output_format="chunks"`.
`rag/flow/hierarchical_merger/hierarchical_merger.py`	Avoids crashes when chunk `text` is `None` by normalizing to empty string during merge.

Copilot · 2026-02-04T12:13:26Z

rag/flow/tokenizer/schema.py

@@ -46,8 +46,8 @@
            if self.output_format == "text" and not self.text_result:
                raise ValueError("output_format=text requires a text payload (field: 'text' or 'text_result').")
            if self.output_format == "html" and not self.html_result:
                raise ValueError("output_format=text requires a html payload (field: 'html' or 'html_result').")
        else:
-            if not self.json_result and not self.chunks:
+            if self.json_result is None:
                raise ValueError("When no chunks are provided and output_format is not markdown/text, a JSON list payload is required (field: 'json' or 'json_result').")


TokenizerFromUpstream now treats an empty chunks list as a valid payload (chunks is not None), but Tokenizer._invoke later checks if from_upstream.chunks: (truthy) and will fall back to json_result when chunks is []. If json_result is unset (common when upstream output_format="chunks"), this becomes None and the subsequent loop over chunks will crash. Either update the tokenizer implementation to treat chunks is not None as provided (including empty), or ensure the validated model provides a safe empty list for the fallback path when chunks is empty.

Copilot · 2026-02-04T12:13:26Z

rag/flow/tokenizer/schema.py

            if self.output_format == "text" and not self.text_result:
                raise ValueError("output_format=text requires a text payload (field: 'text' or 'text_result').")
            if self.output_format == "html" and not self.html_result:
                raise ValueError("output_format=text requires a html payload (field: 'html' or 'html_result').")


The html validation error message is incorrect: when output_format == "html" the exception currently says output_format=text requires a html payload..., which is misleading for users and makes debugging harder. Update the message to reference output_format=html (and consider fixing the article to "an html" if you’re touching it).

Suggested change

raise ValueError("output_format=text requires a html payload (field: 'html' or 'html_result').")

raise ValueError("output_format=html requires an html payload (field: 'html' or 'html_result').")

Copilot AI review requested due to automatic review settings February 4, 2026 12:07

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Feb 4, 2026

Copilot started reviewing on behalf of LiruiYu33 February 4, 2026 12:07 View session

dosubot bot added 🐞 bug Something isn't working, pull request that fix bug. size:M This PR changes 30-99 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Feb 4, 2026

Copilot AI reviewed Feb 4, 2026

View reviewed changes

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Feb 5, 2026

LiruiYu33 closed this Feb 5, 2026

LiruiYu33 force-pushed the main branch from a68deb1 to bbd8ba6 Compare February 5, 2026 09:50

dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Feb 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ingestion pipeline chunk handling and tokenizer validation #12994

Fix ingestion pipeline chunk handling and tokenizer validation #12994

Uh oh!

LiruiYu33 commented Feb 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	raise ValueError("output_format=text requires a html payload (field: 'html' or 'html_result').")
	raise ValueError("output_format=html requires an html payload (field: 'html' or 'html_result').")

Fix ingestion pipeline chunk handling and tokenizer validation #12994

Fix ingestion pipeline chunk handling and tokenizer validation #12994

Uh oh!

Conversation

LiruiYu33 commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LiruiYu33 commented Feb 4, 2026 •

edited

Loading