Skip to content

feat: Add document chunking for large PDFs (300+ pages, 200+ MB) #14

@majcheradam

Description

@majcheradam

Problem

Users need to process large documents (300-500 pages, 200-300 MB) but current OCR services have size limits:

  • Mistral OCR: max 50 MB, 1000 pages
  • Most vision models struggle with very large documents

Proposed Solution

Add document chunking/splitting functionality as a pre-processing step:

Core Features

  • Split large PDFs by page count (configurable, e.g., 100 pages per chunk)
  • Split large PDFs by file size (e.g., max 50 MB per chunk)
  • Async processing of chunks with result aggregation
  • Progress tracking per chunk via WebSocket

API Design

const job = await client.jobs.create({
  file: largePdf,
  type: "parse",
  chunking: {
    enabled: true,
    maxPages: 100,      // or
    maxSizeMb: 50,
  }
});

Implementation Notes

  • Use pdf-lib or similar for splitting without re-rendering
  • Queue each chunk as sub-job
  • Aggregate results maintaining page order
  • Consider batch processing discount (Mistral: 50% cheaper in batch mode)

References

Use Case

Customer inquiry about processing 300-500 page documents (~200-300 MB)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions