A production-ready microservice designed to handle the heavy lifting of PDF text extraction and token-aware chunking for RAG (Retrieval-Augmented Generation) workflows.
- FastAPI Core: High-performance asynchronous endpoints.
- Precision Extraction: Clean text extraction from PDFs using
pypdf. - Token-Aware Chunking: Uses
tiktokento ensure chunks fit perfectly within LLM context windows. - Production Infrastructure: Standardized
Makefile,Dockerfile, and CI/CD.
- Python: 3.10+
- UV: Fast Python package manager
- Make: Build automation tool
- Docker: For containerized deployment
make setupmake devThe API will be available at http://localhost:8000. Access /docs for Swagger UI.
Request: POST /extract (Multipart File)
Output:
{
"full_text": "Extracted document content...",
"total_pages": 5,
"filename": "sample.pdf"
}Request: POST /chunk
{
"text": "Long document text...",
"max_tokens": 1000
}Output:
{
"chunks": ["Part 1...", "Part 2..."],
"total_chunks": 2,
"total_tokens": 1850
}- Initial FastAPI modularization.
- Token-aware chunking logic.
- Support for OCR (Optical Character Recognition) for scanned PDFs.
- Multi-format support (DOCX, HTML).
- Linting:
make lint - Testing:
make test - Container:
make up