A production-grade backend foundation for Retrieval-Augmented Generation (RAG)
A backend-focused system that allows authenticated users to upload documents and retrieve relevant content using a Retrieval-Augmented Generation (RAG) architecture.
This project emphasizes clean system boundaries, deterministic pipelines, asynchronous processing, and cost-aware vector workflows. Ingestion, retrieval, and future LLM layers are intentionally separated so each can evolve independently.
Document ingestion and semantic retrieval are fully implemented and operational. LLM-based answer generation is intentionally deferred.
- User authentication via Clerk
- Secure document upload
- Asynchronous processing using a queue and background workers
- Multi-format document support (PDF, HTML, DOCX, TXT)
- Robust document cleaning and normalization
- Exact and near-duplicate content detection
- Idempotent ingestion using content fingerprints
- Deterministic, resume-safe chunking
- Vector embeddings generation
- Vector storage and search in Pinecone
- Normalized, validated semantic retrieval pipeline
- Modular, production-grade backend architecture
The ingestion pipeline explicitly implements the following stages:
Raw Document
→ Text Extraction (format-aware)
→ Boilerplate Removal
→ Text Normalization
→ Exact Deduplication
→ Chunking (deterministic, resume-safe)
→ Near-Duplicate Detection
→ Metadata Enrichment
→ Quality Validation (guardrails + skip semantics)
→ Embeddings (LangChain-managed batching)
→ Batched Vector Indexing (retry-safe, observable)
This ensures high-quality, deduplicated, and retrieval-ready data while preventing runaway embedding costs.
The retrieval pipeline is designed to produce stable, LLM-ready context:
User Query
→ Query Normalization & Deduplication
→ Query Embedding
→ Centralized Pinecone Search
→ Score-Based Reranking
→ Result Quality Validation
→ Clean Retrieval Output
Retrieval output is intentionally structured to support future LLM integration without refactors.
The system enforces explicit safeguards to ensure predictable behavior:
- Hard chunk limits to prevent runaway documents
- Low-information chunk filtering
- Skip semantics for invalid documents (non-retryable)
- Idempotent ingestion using Redis-backed content fingerprints
- Batched Pinecone indexing with isolated retries
- Namespace isolation to prevent cross-user data leakage
- Retrieval result validation to avoid low-signal context Invalid documents are intentionally skipped, not failed.
Client (Next.js + Clerk)
↓
Express API
├─ Auth
├─ Upload
└─ Retrieval
↓
Valkey / Redis (persistent)
├─ BullMQ (job queue)
├─ Fingerprint store (idempotent ingestion)
└─ Shared Redis client
↓
Background Worker (ingestion only)
├─ Extraction
├─ Cleaning & Normalization
├─ Deduplication
├─ Chunking
├─ Validation
├─ Embeddings
└─ Pinecone Indexing
Express API (retrieval path)
↓
Retrieval Pipeline
├─ Query normalization
├─ Query embedding
├─ Pinecone search
├─ Reranking
└─ Result validation
- Authentication enforced at the API boundary
- All heavy processing is offloaded to background workers
- Pipelines are deterministic and retry-safe
- Next.js
- TypeScript
- Clerk Authentication
- Node.js
- Express
- Multer
- BullMQ (queue + worker)
- Valkey / Redis
- LangChain
- HuggingFace Inference embeddings
- Pinecone (vector database)
- Docker
- Docker Compose
pdf_rag_assistant
├── server
│ ├── app.js # API server
│ ├── worker.js # Background worker
│ ├── common/ # Shared cross-pipeline logic
│ │ ├── embeddings/
│ │ ├── vectorStore/
│ │ ├── normalize/
│ │ ├── quality/
│ │ └── context/
│ ├── ingestion/
│ │ ├── extract/
│ │ ├── clean/
│ │ ├── dedupe/
│ │ ├── chunk/
│ │ ├── enrich/
│ │ ├── index/
│ │ └── pipeline.js
│ ├── retrieval/
│ │ ├── parse/
│ │ ├── embed/
│ │ ├── search/
│ │ ├── rerank/
│ │ └── pipeline.js
│ ├── routes/
│ ├── utils/
│ └── config/
├── client/web
├── scripts
├── docker-compose.yml- Node.js (v18+)
- Docker
Create environment files using the provided .env.example templates.
A single Redis connection is shared by queues, idempotency, and batching utilities.
Redis / Valkey is used for:
- Job queueing (BullMQ)
- Persistent ingestion fingerprints
- Resume-safe processing Redis persistence --must be enabled-- to preserve idempotency across restarts.
Supported persistence modes:
- RDB snapshots (recommended)
- AOF (append-only file)
When using Docker, a volume must be mounted to persist data.
From the project root, run:
npm install
npm run devThis will:
- Start Redis / Valkey
- Start the backend API
- Start the background worker
- Start the frontend application
- Workers consume jobs from a shared queue
- Concurrency is intentionally limited
- Files are read using normalized absolute paths
- Extraction uses buffer-based loading for cross-platform safety
- Batch-level retries occur only where failures happen
- Invalid documents are skipped, not retried
| Feature | Status |
|---|---|
| Authentication & uploads | ✅ |
| Asynchronous ingestion | ✅ |
| Multi-format extraction | ✅ |
| Cleaning & deduplication | ✅ |
| Embeddings & Pinecone storage | ✅ |
| Retrieval API | ✅ |
| LLM answer generation | ❌ |
- Clear separation of concerns (API, worker, pipelines)
- Deterministic, retry-safe ingestion and retrieval
- Infrastructure decoupled from business logic
- Idempotency relies on persistent state as a correctness requirement
- Prefer no ingestion over incorrect ingestion
- LLM-ready but not LLM-dependent
- Single shared Redis client to prevent connection sprawl
- Batching applied only at network boundaries
- Embeddings batched internally by LangChain
- Pinecone indexing batched at the application level
- Each batch logged with start / success / failure states
- Retries scoped to failing batches, not entire documents
- Failures propagate cleanly for queue-level retries
- LLM-based answer generation
- Context window management
- Source citations
- Hybrid retrieval (keyword + vector)
- Multi-document reasoning
This project provides a production-grade ingestion and retrieval foundation for RAG systems.
The pipelines are deterministic, idempotent, cost-aware, and designed to run safely at scale without duplicate data or unbounded embedding costs.