A comprehensive testing and validation platform for document processing pipelines. Test and compare multiple document extraction libraries, AI-powered image captioning methods, and text chunking strategies in a professional web interface.
- Multi-Library Support: Test LangChain Docling, PyMuPDF4LLM, and more
- Format Conversion: Convert documents to Markdown, Plain Text, or JSON
- Supported Formats: PDF, DOCX, TXT, Markdown
- OCR Methods: Tesseract (fast), EasyOCR (multi-language)
- LLM-Based Captioning: OpenAI GPT-4 Vision, Anthropic Claude Vision, Ollama (local)
- Smart Image Filtering: Automatically ignores icons and small images
- Trafilatura Integration: Extract clean content from webpages
- Image Processing: Caption images from web sources
- Markdown Output: Clean, structured markdown format
- 8 Chunking Methods: Sentence, Paragraph, Character, Token, Recursive, Markdown Headers, Semantic, JSON
- Configurable Parameters: Chunk size, overlap, and method-specific options
- Real-time Statistics: View chunk count, average length, and metadata
- Modern Interface: Beautiful gradient design with smooth animations
- Side-by-Side Preview: View original and processed content simultaneously
- Progress Tracking: Real-time processing status with loading indicators
- File Management: Download original and processed files
- FastAPI: High-performance Python web framework
- LangChain: Document loading and processing
- PyMuPDF4LLM: Advanced PDF processing
- Trafilatura: Web content extraction
- Tesseract/EasyOCR: OCR capabilities
- OpenAI/Anthropic/Ollama: LLM-based image captioning
- React: Modern UI library
- Lucide Icons: Beautiful icon set
- Axios: HTTP client
- React Router: Navigation
cd backend
uv sync
uv run main.pycd frontend
npm install
npm start- Testing Document Parsers: Compare output quality across different libraries
- Validating Extraction Pipelines: Ensure your document processing workflow works correctly
- Benchmarking Performance: Test different methods for speed and accuracy
- Prototyping: Quickly test different approaches before production implementation
- Research & Development: Experiment with various AI models and techniques
- Configure OCR providers and LLM models via UI
- Set image size thresholds in
backend/config.py - Add custom chunking methods in
backend/utils/chunker.py - Extend library support in
backend/utils/document_processor.py
POST /process-document: Process uploaded documentsPOST /extract-webpage: Extract web contentPOST /chunk/{method}: Perform text chunkingGET /libraries: Get available processing librariesGET /methods: Get available chunking methods
- ✅ No vendor lock-in - test multiple libraries side-by-side
- ✅ Easy integration - modular design for adding new techniques
- ✅ Production-ready - proper error handling and logging
- ✅ Self-hosted - run locally with full control over your data
- ✅ Extensible - add custom processors and chunking methods
MIT License - See LICENSE file for details
Contributions welcome! Please read CONTRIBUTING.md for guidelines.
- Batch processing support
- Performance metrics and benchmarking
- Export comparison reports
- Custom chunking strategy builder
- Docker containerization
- API key management UI
For issues and questions, please open a GitHub issue.
Built for testers, developers, and researchers working with document processing pipelines.
## GitHub Topics/Tags:
document-processing pdf-extraction text-chunking ocr langchain fastapi react ai-captioning document-validation testing-tools markdown-converter web-scraping nlp machine-learning
## Social Media Description (Twitter/LinkedIn):
🚀 Just built doc-process-validator - a professional tool to test & compare document processing libraries, AI image captioning (OCR + LLM), and chunking strategies. Perfect for validating extraction pipelines!
✨ Features: Multi-library support, 8 chunking methods, web extraction, modern UI
#DocumentProcessing #AI #OpenSource