Add Pydantic validation for LLM responses in page_index#100
Open
adityasasidhar wants to merge 4 commits intoVectifyAI:mainfrom
Open
Add Pydantic validation for LLM responses in page_index#100adityasasidhar wants to merge 4 commits intoVectifyAI:mainfrom
adityasasidhar wants to merge 4 commits intoVectifyAI:mainfrom
Conversation
- Introduced Pydantic models for all LLM JSON responses to ensure type safety and structural integrity. - Replaced manual `extract_json` calls with `parse_llm_response` and specific Pydantic models. - Defined models: `CheckTitleAppearanceResponse`, `TocDetectorResponse`, `TocIndexResponse`, `TocTransformerResponse`, `AddPageNumberResponse`, `GenerateTocResponse`, and others. - Updated core functions including: - `check_title_appearance` - `toc_detector_single_page` - `toc_index_extractor` - `toc_transformer` - `add_page_number_to_toc` - `generate_toc_init` - `single_toc_item_index_fixer` - Implemented robust handling for list-based LLM outputs by wrapping them in root models.
- Fix typo: `tob_extractor_prompt` -> `toc_extractor_prompt`.
- Fix parse_llm_response() to preserve title casing - Add whitelist for normalizable fields (answer, start_begin, etc.) - Standardize all validation to use parse_llm_response() helper - Add logger parameter to validation functions for consistent error reporting - Update all function call sites to pass logger parameter
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Summary
This PR improves the reliability of
pageindex/page_index.pyby introducing Pydantic based validation for LLM generated JSON outputs and fixing several related correctness issues.Previously, the file relied on system prompt–defined output formats without validation. Since LLMs can return malformed or partial JSON, this approach was fragile and could lead to silent failures. This change enforces strict schemas and makes failures deterministic and easier to debug.
Key Changes
Pydantic Validation
parse_llm_response(...)helper to extract, normalize, and validate API outputsNew Pydantic Models
The following models were introduced to enforce LLM response structure:
CheckTitleAppearanceResponseCheckTitleAppearanceInStartResponseTocDetectorResponseTocCompletionResponsePageIndexDetectionResponseTocIndexItem/TocIndexResponse(list responses)TocTransformerItem/TocTransformerResponseAddPageNumberItem/AddPageNumberResponseSingleTocFixerResponseGenerateTocItem/GenerateTocResponseRefactored Functions
Updated the following functions to use validated Pydantic responses:
check_title_appearancecheck_title_appearance_in_starttoc_detector_single_pagecheck_if_toc_extraction_is_completecheck_if_toc_transformation_is_completedetect_page_indextoc_index_extractortoc_transformeradd_page_number_to_tocgenerate_toc_initgenerate_toc_continuesingle_toc_item_index_fixerBug Fixes
toc_index_extractorprompt variable nameadd_page_number_to_tocextract_toc_contentand
Fixed
parse_llm_response()
normalization - Title casing is now preserved (only specific fields like answer, start_begin, etc. are lowercased)
Standardized validation - All functions now use the
parse_llm_response()
helper consistently
Improved error reporting - Added logger parameters throughout for consistent error logging
Cleaned up code - Removed code duplication and manual validation blocks