[MS] Add OCR layer service for embedded images and PDF scans #1541

lesyk · 2026-01-26T18:45:18Z

This pull request introduces comprehensive OCR support to MarkItDown, enabling extraction of text from images embedded in documents (PDF, DOCX, XLSX, PPTX) using multiple backends (Tesseract, EasyOCR, LLM Vision, Azure Document Intelligence). It adds a unified OCR service layer with graceful fallback between backends, automatic detection and handling of scanned PDFs, and inline extraction of image text in DOCX files. The README is updated with detailed usage examples for the new OCR features.

#1344

- Created HTML file with multiple images for testing OCR extraction. - Added several PDF files with different layouts and image placements to validate OCR functionality. - Introduced PPTX files with complex layouts and images at various positions for comprehensive testing. - Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction. - Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy.

- Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency. - Implement detailed validation for OCR text positioning relative to surrounding text in test cases. - Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present. - Improve error handling and logging for better debugging during OCR extraction.

…t tests

lesyk and others added 4 commits January 26, 2026 19:44

Merge branch 'main' into u/vilesyk/inline_image

2e83594

Add support for scanned PDFs with full-page OCR fallback and implemen…

f4fab9b

…t tests

lesyk marked this pull request as ready for review January 27, 2026 10:21

zashed approved these changes Jan 27, 2026

View reviewed changes

lesyk changed the title ~~Add OCR test data and implement tests for various document formats~~ Add OCR service for embedded images and PDF scans Jan 27, 2026

lesyk changed the title ~~Add OCR service for embedded images and PDF scans~~ Add OCR layer service for embedded images and PDF scans Jan 27, 2026

lesyk changed the title ~~Add OCR layer service for embedded images and PDF scans~~ [MS] Add OCR layer service for embedded images and PDF scans Jan 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MS] Add OCR layer service for embedded images and PDF scans #1541

[MS] Add OCR layer service for embedded images and PDF scans #1541

lesyk commented Jan 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[MS] Add OCR layer service for embedded images and PDF scans #1541

Are you sure you want to change the base?

[MS] Add OCR layer service for embedded images and PDF scans #1541

Conversation

lesyk commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lesyk commented Jan 26, 2026 •

edited

Loading