ocrbase

Turn PDFs into structured data at scale. Powered by frontier open-weight OCR models.

Features

Best-in-class OCR - PaddleOCR-VL-1.5 0.9B for accurate text extraction
Structured extraction - Define schemas, get JSON back
Built for scale - Queue-based scaling using BullMQ
Real-time updates - WebSocket notifications for job progress
Self-hostable - Run on your own infrastructure using Self-Hosting Guide

SDK

NOTE: TS SDK is currently moving to ocrbase-typescript

API Docs

OpenAPI UI: https://api.ocrbase.dev/openapi
OpenAPI JSON: https://api.ocrbase.dev/openapi/json

API Usage

# Parse a document
curl -X POST https://api.ocrbase.dev/v1/parse \
  -H "Authorization: Bearer sk_xxx" \
  -F "file=@document.pdf"

# Extract with schema
curl -X POST https://api.ocrbase.dev/v1/extract \
  -H "Authorization: Bearer sk_xxx" \
  -F "file=@invoice.pdf" \
  -F "schemaId=inv_schema_123"

NOTE: Jobs are processed asynchronously.

Realtime Updates

# Subscribe to job status updates
wscat -c "wss://api.ocrbase.dev/v1/realtime?job_id=job_xxx" \
  -H "Authorization: Bearer sk_xxx"

Health Checks

GET /v1/health/live
GET /v1/health/ready

LLM Integration

Best practice: Parse documents with ocrbase before sending to LLMs. Raw PDF binary wastes tokens and produces poor results.

Self-Hosting

See Self-Hosting Guide for deployment instructions.

Requirements: Docker, Bun

Architecture

License

MIT - See LICENSE for details.

Contact

For API access, on-premise deployment, or questions: adammajcher20@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.claude		.claude
.codex		.codex
.github/workflows		.github/workflows
.vscode		.vscode
apps/server		apps/server
docker/paddleocr		docker/paddleocr
docs		docs
examples		examples
packages		packages
spec		spec
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.oxfmtrc.jsonc		.oxfmtrc.jsonc
.oxlintrc.json		.oxlintrc.json
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
lefthook.yml		lefthook.yml
package.json		package.json
tsconfig.json		tsconfig.json
turbo.json		turbo.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ocrbase

Features

SDK

API Docs

API Usage

Realtime Updates

Health Checks

LLM Integration

Self-Hosting

Architecture

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

ocrbase-hq/ocrbase

Folders and files

Latest commit

History

Repository files navigation

ocrbase

Features

SDK

API Docs

API Usage

Realtime Updates

Health Checks

LLM Integration

Self-Hosting

Architecture

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages