PDF RAG Processor

A production-ready microservice designed to handle the heavy lifting of PDF text extraction and token-aware chunking for RAG (Retrieval-Augmented Generation) workflows.

Features

FastAPI Core: High-performance asynchronous endpoints.
Precision Extraction: Clean text extraction from PDFs using pypdf.
Token-Aware Chunking: Uses tiktoken to ensure chunks fit perfectly within LLM context windows.
Production Infrastructure: Standardized Makefile, Dockerfile, and CI/CD.

Prerequisites

Python: 3.10+
UV: Fast Python package manager
Make: Build automation tool
Docker: For containerized deployment

Usage

1. Setup & Installation

make setup

2. Run Development Server

make dev

The API will be available at http://localhost:8000. Access /docs for Swagger UI.

3. API Scenarios

Scenario: Extract Text from PDF

Request: POST /extract (Multipart File) Output:

{
  "full_text": "Extracted document content...",
  "total_pages": 5,
  "filename": "sample.pdf"
}

Scenario: Generate LLM-Ready Chunks

Request: POST /chunk

{
  "text": "Long document text...",
  "max_tokens": 1000
}

Output:

{
  "chunks": ["Part 1...", "Part 2..."],
  "total_chunks": 2,
  "total_tokens": 1850
}

Roadmap

Initial FastAPI modularization.
Token-aware chunking logic.
Support for OCR (Optical Character Recognition) for scanned PDFs.
Multi-format support (DOCX, HTML).

Development

Linting: make lint
Testing: make test
Container: make up

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
app		app
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF RAG Processor

Features

Prerequisites

Usage

1. Setup & Installation

2. Run Development Server

3. API Scenarios

Scenario: Extract Text from PDF

Scenario: Generate LLM-Ready Chunks

Roadmap

Development

About

Uh oh!

Releases

Packages

Languages

suryaelidanto/pdf-rag-processor

Folders and files

Latest commit

History

Repository files navigation

PDF RAG Processor

Features

Prerequisites

Usage

1. Setup & Installation

2. Run Development Server

3. API Scenarios

Scenario: Extract Text from PDF

Scenario: Generate LLM-Ready Chunks

Roadmap

Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages