Skip to content

Conversation

@lesyk
Copy link
Contributor

@lesyk lesyk commented Jan 26, 2026

This pull request introduces comprehensive OCR support to MarkItDown, enabling extraction of text from images embedded in documents (PDF, DOCX, XLSX, PPTX) using multiple backends (Tesseract, EasyOCR, LLM Vision, Azure Document Intelligence). It adds a unified OCR service layer with graceful fallback between backends, automatic detection and handling of scanned PDFs, and inline extraction of image text in DOCX files. The README is updated with detailed usage examples for the new OCR features.

#1344

lesyk and others added 4 commits January 26, 2026 19:44
- Created HTML file with multiple images for testing OCR extraction.
- Added several PDF files with different layouts and image placements to validate OCR functionality.
- Introduced PPTX files with complex layouts and images at various positions for comprehensive testing.
- Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction.
- Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy.
- Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency.
- Implement detailed validation for OCR text positioning relative to surrounding text in test cases.
- Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present.
- Improve error handling and logging for better debugging during OCR extraction.
@lesyk lesyk marked this pull request as ready for review January 27, 2026 10:21
@lesyk lesyk changed the title Add OCR test data and implement tests for various document formats Add OCR service for embedded images and PDF scans Jan 27, 2026
@lesyk lesyk changed the title Add OCR service for embedded images and PDF scans Add OCR layer service for embedded images and PDF scans Jan 27, 2026
@lesyk lesyk changed the title Add OCR layer service for embedded images and PDF scans [MS] Add OCR layer service for embedded images and PDF scans Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants