Skip to content

A professional document processing validation tool with multi-library support, AI-powered image captioning, and advanced chunking strategies for testing document extraction workflows.

License

Notifications You must be signed in to change notification settings

RaviVaishnav20/doc-process-validator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document Process Validator

A comprehensive testing and validation platform for document processing pipelines. Test and compare multiple document extraction libraries, AI-powered image captioning methods, and text chunking strategies in a professional web interface.

🚀 Features

Document Processing

  • Multi-Library Support: Test LangChain Docling, PyMuPDF4LLM, and more
  • Format Conversion: Convert documents to Markdown, Plain Text, or JSON
  • Supported Formats: PDF, DOCX, TXT, Markdown

AI-Powered Image Captioning

  • OCR Methods: Tesseract (fast), EasyOCR (multi-language)
  • LLM-Based Captioning: OpenAI GPT-4 Vision, Anthropic Claude Vision, Ollama (local)
  • Smart Image Filtering: Automatically ignores icons and small images

Web Content Extraction

  • Trafilatura Integration: Extract clean content from webpages
  • Image Processing: Caption images from web sources
  • Markdown Output: Clean, structured markdown format

Text Chunking Strategies

  • 8 Chunking Methods: Sentence, Paragraph, Character, Token, Recursive, Markdown Headers, Semantic, JSON
  • Configurable Parameters: Chunk size, overlap, and method-specific options
  • Real-time Statistics: View chunk count, average length, and metadata

Professional UI

  • Modern Interface: Beautiful gradient design with smooth animations
  • Side-by-Side Preview: View original and processed content simultaneously
  • Progress Tracking: Real-time processing status with loading indicators
  • File Management: Download original and processed files

🛠️ Technology Stack

Backend

  • FastAPI: High-performance Python web framework
  • LangChain: Document loading and processing
  • PyMuPDF4LLM: Advanced PDF processing
  • Trafilatura: Web content extraction
  • Tesseract/EasyOCR: OCR capabilities
  • OpenAI/Anthropic/Ollama: LLM-based image captioning

Frontend

  • React: Modern UI library
  • Lucide Icons: Beautiful icon set
  • Axios: HTTP client
  • React Router: Navigation

📦 Installation

Backend Setup

cd backend
uv sync
uv run main.py

Frontend Setup

cd frontend
npm install
npm start

🎯 Use Cases

  • Testing Document Parsers: Compare output quality across different libraries
  • Validating Extraction Pipelines: Ensure your document processing workflow works correctly
  • Benchmarking Performance: Test different methods for speed and accuracy
  • Prototyping: Quickly test different approaches before production implementation
  • Research & Development: Experiment with various AI models and techniques

🔧 Configuration

  • Configure OCR providers and LLM models via UI
  • Set image size thresholds in backend/config.py
  • Add custom chunking methods in backend/utils/chunker.py
  • Extend library support in backend/utils/document_processor.py

📝 API Endpoints

  • POST /process-document: Process uploaded documents
  • POST /extract-webpage: Extract web content
  • POST /chunk/{method}: Perform text chunking
  • GET /libraries: Get available processing libraries
  • GET /methods: Get available chunking methods

🌟 Key Highlights

  • ✅ No vendor lock-in - test multiple libraries side-by-side
  • ✅ Easy integration - modular design for adding new techniques
  • ✅ Production-ready - proper error handling and logging
  • ✅ Self-hosted - run locally with full control over your data
  • ✅ Extensible - add custom processors and chunking methods

📄 License

MIT License - See LICENSE file for details

🤝 Contributing

Contributions welcome! Please read CONTRIBUTING.md for guidelines.

💡 Future Enhancements

  • Batch processing support
  • Performance metrics and benchmarking
  • Export comparison reports
  • Custom chunking strategy builder
  • Docker containerization
  • API key management UI

📧 Support

For issues and questions, please open a GitHub issue.


Built for testers, developers, and researchers working with document processing pipelines.


## GitHub Topics/Tags:

document-processing pdf-extraction text-chunking ocr langchain fastapi react ai-captioning document-validation testing-tools markdown-converter web-scraping nlp machine-learning


## Social Media Description (Twitter/LinkedIn):

🚀 Just built doc-process-validator - a professional tool to test & compare document processing libraries, AI image captioning (OCR + LLM), and chunking strategies. Perfect for validating extraction pipelines!

✨ Features: Multi-library support, 8 chunking methods, web extraction, modern UI

#DocumentProcessing #AI #OpenSource

About

A professional document processing validation tool with multi-library support, AI-powered image captioning, and advanced chunking strategies for testing document extraction workflows.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published