Document Process Validator

A comprehensive testing and validation platform for document processing pipelines. Test and compare multiple document extraction libraries, AI-powered image captioning methods, and text chunking strategies in a professional web interface.

🚀 Features

Document Processing

Multi-Library Support: Test LangChain Docling, PyMuPDF4LLM, and more
Format Conversion: Convert documents to Markdown, Plain Text, or JSON
Supported Formats: PDF, DOCX, TXT, Markdown

AI-Powered Image Captioning

OCR Methods: Tesseract (fast), EasyOCR (multi-language)
LLM-Based Captioning: OpenAI GPT-4 Vision, Anthropic Claude Vision, Ollama (local)
Smart Image Filtering: Automatically ignores icons and small images

Web Content Extraction

Trafilatura Integration: Extract clean content from webpages
Image Processing: Caption images from web sources
Markdown Output: Clean, structured markdown format

Text Chunking Strategies

8 Chunking Methods: Sentence, Paragraph, Character, Token, Recursive, Markdown Headers, Semantic, JSON
Configurable Parameters: Chunk size, overlap, and method-specific options
Real-time Statistics: View chunk count, average length, and metadata

Professional UI

Modern Interface: Beautiful gradient design with smooth animations
Side-by-Side Preview: View original and processed content simultaneously
Progress Tracking: Real-time processing status with loading indicators
File Management: Download original and processed files

🛠️ Technology Stack

Backend

FastAPI: High-performance Python web framework
LangChain: Document loading and processing
PyMuPDF4LLM: Advanced PDF processing
Trafilatura: Web content extraction
Tesseract/EasyOCR: OCR capabilities
OpenAI/Anthropic/Ollama: LLM-based image captioning

Frontend

React: Modern UI library
Lucide Icons: Beautiful icon set
Axios: HTTP client
React Router: Navigation

📦 Installation

Backend Setup

cd backend
uv sync
uv run main.py

Frontend Setup

cd frontend
npm install
npm start

🎯 Use Cases

Testing Document Parsers: Compare output quality across different libraries
Validating Extraction Pipelines: Ensure your document processing workflow works correctly
Benchmarking Performance: Test different methods for speed and accuracy
Prototyping: Quickly test different approaches before production implementation
Research & Development: Experiment with various AI models and techniques

🔧 Configuration

Configure OCR providers and LLM models via UI
Set image size thresholds in backend/config.py
Add custom chunking methods in backend/utils/chunker.py
Extend library support in backend/utils/document_processor.py

📝 API Endpoints

POST /process-document: Process uploaded documents
POST /extract-webpage: Extract web content
POST /chunk/{method}: Perform text chunking
GET /libraries: Get available processing libraries
GET /methods: Get available chunking methods

🌟 Key Highlights

✅ No vendor lock-in - test multiple libraries side-by-side
✅ Easy integration - modular design for adding new techniques
✅ Production-ready - proper error handling and logging
✅ Self-hosted - run locally with full control over your data
✅ Extensible - add custom processors and chunking methods

📄 License

MIT License - See LICENSE file for details

🤝 Contributing

Contributions welcome! Please read CONTRIBUTING.md for guidelines.

💡 Future Enhancements

📧 Support

For issues and questions, please open a GitHub issue.

Built for testers, developers, and researchers working with document processing pipelines.


## GitHub Topics/Tags:

document-processing pdf-extraction text-chunking ocr langchain fastapi react ai-captioning document-validation testing-tools markdown-converter web-scraping nlp machine-learning


## Social Media Description (Twitter/LinkedIn):

🚀 Just built doc-process-validator - a professional tool to test & compare document processing libraries, AI image captioning (OCR + LLM), and chunking strategies. Perfect for validating extraction pipelines!

✨ Features: Multi-library support, 8 chunking methods, web extraction, modern UI

#DocumentProcessing #AI #OpenSource

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Process Validator

🚀 Features

Document Processing

AI-Powered Image Captioning

Web Content Extraction

Text Chunking Strategies

Professional UI

🛠️ Technology Stack

Backend

Frontend

📦 Installation

Backend Setup

Frontend Setup

🎯 Use Cases

🔧 Configuration

📝 API Endpoints

🌟 Key Highlights

📄 License

🤝 Contributing

💡 Future Enhancements

📧 Support

About

Uh oh!

Releases

Packages

Languages

License

RaviVaishnav20/doc-process-validator

Folders and files

Latest commit

History

Repository files navigation

Document Process Validator

🚀 Features

Document Processing

AI-Powered Image Captioning

Web Content Extraction

Text Chunking Strategies

Professional UI

🛠️ Technology Stack

Backend

Frontend

📦 Installation

Backend Setup

Frontend Setup

🎯 Use Cases

🔧 Configuration

📝 API Endpoints

🌟 Key Highlights

📄 License

🤝 Contributing

💡 Future Enhancements

📧 Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages