Turn PDFs into structured data at scale. Powered by frontier open-weight OCR models.
- Best-in-class OCR - PaddleOCR-VL-1.5 0.9B for accurate text extraction
- Structured extraction - Define schemas, get JSON back
- Built for scale - Queue-based scaling using BullMQ
- Real-time updates - WebSocket notifications for job progress
- Self-hostable - Run on your own infrastructure using Self-Hosting Guide
NOTE: TS SDK is currently moving to ocrbase-typescript
- OpenAPI UI:
https://api.ocrbase.dev/openapi - OpenAPI JSON:
https://api.ocrbase.dev/openapi/json
# Parse a document
curl -X POST https://api.ocrbase.dev/v1/parse \
-H "Authorization: Bearer sk_xxx" \
-F "file=@document.pdf"
# Extract with schema
curl -X POST https://api.ocrbase.dev/v1/extract \
-H "Authorization: Bearer sk_xxx" \
-F "file=@invoice.pdf" \
-F "schemaId=inv_schema_123"NOTE: Jobs are processed asynchronously.
# Subscribe to job status updates
wscat -c "wss://api.ocrbase.dev/v1/realtime?job_id=job_xxx" \
-H "Authorization: Bearer sk_xxx"GET /v1/health/liveGET /v1/health/ready
Best practice: Parse documents with ocrbase before sending to LLMs. Raw PDF binary wastes tokens and produces poor results.
See Self-Hosting Guide for deployment instructions.
Requirements: Docker, Bun
MIT - See LICENSE for details.
For API access, on-premise deployment, or questions: adammajcher20@gmail.com