Add Pydantic validation for LLM responses in page_index by adityasasidhar · Pull Request #100 · VectifyAI/PageIndex

adityasasidhar · 2026-02-04T15:45:23Z

Pull Request Summary

This PR improves the reliability of pageindex/page_index.py by introducing Pydantic based validation for LLM generated JSON outputs and fixing several related correctness issues.

Previously, the file relied on system prompt–defined output formats without validation. Since LLMs can return malformed or partial JSON, this approach was fragile and could lead to silent failures. This change enforces strict schemas and makes failures deterministic and easier to debug.

Key Changes

Pydantic Validation

Added Pydantic models to strictly validate most LLM responses
Introduced a parse_llm_response(...) helper to extract, normalize, and validate API outputs
Replaced manual JSON parsing with schema enforced validation

New Pydantic Models

The following models were introduced to enforce LLM response structure:

CheckTitleAppearanceResponse
CheckTitleAppearanceInStartResponse
TocDetectorResponse
TocCompletionResponse
PageIndexDetectionResponse
TocIndexItem / TocIndexResponse (list responses)
TocTransformerItem / TocTransformerResponse
AddPageNumberItem / AddPageNumberResponse
SingleTocFixerResponse
GenerateTocItem / GenerateTocResponse

Refactored Functions

Updated the following functions to use validated Pydantic responses:

check_title_appearance
check_title_appearance_in_start
toc_detector_single_page
check_if_toc_extraction_is_complete
check_if_toc_transformation_is_complete
detect_page_index
toc_index_extractor
toc_transformer
add_page_number_to_toc
generate_toc_init
generate_toc_continue
single_toc_item_index_fixer

Bug Fixes

Fixed typo in toc_index_extractor prompt variable name
Avoided variable shadowing in add_page_number_to_toc
Corrected retry limit mismatch in extract_toc_content

and

Fixed
parse_llm_response()
normalization - Title casing is now preserved (only specific fields like answer, start_begin, etc. are lowercased)
Standardized validation - All functions now use the
parse_llm_response()
helper consistently
Improved error reporting - Added logger parameters throughout for consistent error logging
Cleaned up code - Removed code duplication and manual validation blocks

- Introduced Pydantic models for all LLM JSON responses to ensure type safety and structural integrity. - Replaced manual `extract_json` calls with `parse_llm_response` and specific Pydantic models. - Defined models: `CheckTitleAppearanceResponse`, `TocDetectorResponse`, `TocIndexResponse`, `TocTransformerResponse`, `AddPageNumberResponse`, `GenerateTocResponse`, and others. - Updated core functions including: - `check_title_appearance` - `toc_detector_single_page` - `toc_index_extractor` - `toc_transformer` - `add_page_number_to_toc` - `generate_toc_init` - `single_toc_item_index_fixer` - Implemented robust handling for list-based LLM outputs by wrapping them in root models.

- Fix typo: `tob_extractor_prompt` -> `toc_extractor_prompt`.

- Fix parse_llm_response() to preserve title casing - Add whitelist for normalizable fields (answer, start_begin, etc.) - Standardize all validation to use parse_llm_response() helper - Add logger parameter to validation functions for consistent error reporting - Update all function call sites to pass logger parameter

rejojer and others added 4 commits February 4, 2026 21:04

Fix KeyError when TOC items lack page numbers

6fecaeb

fix: address imports, logic errors, and typos in page_index.py

cd1156b

- Fix typo: `tob_extractor_prompt` -> `toc_extractor_prompt`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Pydantic validation for LLM responses in page_index#100

Add Pydantic validation for LLM responses in page_index#100
adityasasidhar wants to merge 4 commits intoVectifyAI:mainfrom
adityasasidhar:main

adityasasidhar commented Feb 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adityasasidhar commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Summary

Key Changes

Pydantic Validation

New Pydantic Models

Refactored Functions

Bug Fixes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adityasasidhar commented Feb 4, 2026 •

edited

Loading