-
Notifications
You must be signed in to change notification settings - Fork 61
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Problem
Users need to process large documents (300-500 pages, 200-300 MB) but current OCR services have size limits:
- Mistral OCR: max 50 MB, 1000 pages
- Most vision models struggle with very large documents
Proposed Solution
Add document chunking/splitting functionality as a pre-processing step:
Core Features
- Split large PDFs by page count (configurable, e.g., 100 pages per chunk)
- Split large PDFs by file size (e.g., max 50 MB per chunk)
- Async processing of chunks with result aggregation
- Progress tracking per chunk via WebSocket
API Design
const job = await client.jobs.create({
file: largePdf,
type: "parse",
chunking: {
enabled: true,
maxPages: 100, // or
maxSizeMb: 50,
}
});Implementation Notes
- Use
pdf-libor similar for splitting without re-rendering - Queue each chunk as sub-job
- Aggregate results maintaining page order
- Consider batch processing discount (Mistral: 50% cheaper in batch mode)
References
Use Case
Customer inquiry about processing 300-500 page documents (~200-300 MB)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request