PDF RAG Assistant

A production-grade backend foundation for Retrieval-Augmented Generation (RAG)

A backend-focused system that allows authenticated users to upload documents and retrieve relevant content using a Retrieval-Augmented Generation (RAG) architecture.

This project emphasizes clean system boundaries, deterministic pipelines, asynchronous processing, and cost-aware vector workflows. Ingestion, retrieval, and future LLM layers are intentionally separated so each can evolve independently.

Document ingestion and semantic retrieval are fully implemented and operational. LLM-based answer generation is intentionally deferred.

🚀 Key Features

User authentication via Clerk
Secure document upload
Asynchronous processing using a queue and background workers
Multi-format document support (PDF, HTML, DOCX, TXT)
Robust document cleaning and normalization
Exact and near-duplicate content detection
Idempotent ingestion using content fingerprints
Deterministic, resume-safe chunking
Vector embeddings generation
Vector storage and search in Pinecone
Normalized, validated semantic retrieval pipeline
Modular, production-grade backend architecture

⚠️ LLM-based answer generation is intentionally not implemented yet. This repository focuses on ingestion and retrieval correctness first.

Ingestion Pipeline

The ingestion pipeline explicitly implements the following stages:

Raw Document
 → Text Extraction (format-aware)
 → Boilerplate Removal
 → Text Normalization
 → Exact Deduplication
 → Chunking (deterministic, resume-safe)
 → Near-Duplicate Detection
 → Metadata Enrichment
 → Quality Validation (guardrails + skip semantics)
 → Embeddings (LangChain-managed batching)
 → Batched Vector Indexing (retry-safe, observable)

This ensures high-quality, deduplicated, and retrieval-ready data while preventing runaway embedding costs.

Retrieval Pipeline

The retrieval pipeline is designed to produce stable, LLM-ready context:

User Query
 → Query Normalization & Deduplication
 → Query Embedding
 → Centralized Pinecone Search
 → Score-Based Reranking
 → Result Quality Validation
 → Clean Retrieval Output

Retrieval output is intentionally structured to support future LLM integration without refactors.

🛡 Guardrails & Cost Control

The system enforces explicit safeguards to ensure predictable behavior:

Hard chunk limits to prevent runaway documents
Low-information chunk filtering
Skip semantics for invalid documents (non-retryable)
Idempotent ingestion using Redis-backed content fingerprints
Batched Pinecone indexing with isolated retries
Namespace isolation to prevent cross-user data leakage
Retrieval result validation to avoid low-signal context Invalid documents are intentionally skipped, not failed.

🏗 Architecture Overview

Client (Next.js + Clerk)
        ↓
Express API
  ├─ Auth
  ├─ Upload
  └─ Retrieval
        ↓
Valkey / Redis (persistent)
  ├─ BullMQ (job queue)
  ├─ Fingerprint store (idempotent ingestion)
  └─ Shared Redis client
        ↓
Background Worker (ingestion only)
  ├─ Extraction
  ├─ Cleaning & Normalization
  ├─ Deduplication
  ├─ Chunking
  ├─ Validation
  ├─ Embeddings
  └─ Pinecone Indexing

Express API (retrieval path)
        ↓
Retrieval Pipeline
  ├─ Query normalization
  ├─ Query embedding
  ├─ Pinecone search
  ├─ Reranking
  └─ Result validation

Authentication enforced at the API boundary
All heavy processing is offloaded to background workers
Pipelines are deterministic and retry-safe

Tech Stack

Frontend

Next.js
TypeScript
Clerk Authentication

Backend

Node.js
Express
Multer
BullMQ (queue + worker)
Valkey / Redis

AI / Vector Infrastructure

LangChain
HuggingFace Inference embeddings
Pinecone (vector database)

Infrastructure

Docker
Docker Compose

📁 Project Structure

pdf_rag_assistant
├── server
│   ├── app.js                  # API server
│   ├── worker.js               # Background worker
│   ├── common/                 # Shared cross-pipeline logic
│   │   ├── embeddings/
│   │   ├── vectorStore/
│   │   ├── normalize/
│   │   ├── quality/
│   │   └── context/
│   ├── ingestion/
│   │   ├── extract/
│   │   ├── clean/
│   │   ├── dedupe/
│   │   ├── chunk/
│   │   ├── enrich/
│   │   ├── index/
│   │   └── pipeline.js
│   ├── retrieval/
│   │   ├── parse/
│   │   ├── embed/
│   │   ├── search/
│   │   ├── rerank/
│   │   └── pipeline.js
│   ├── routes/
│   ├── utils/
│   └── config/
├── client/web
├── scripts
├── docker-compose.yml

Local Setup

Prerequisites

Node.js (v18+)
Docker

Environment Variables

Create environment files using the provided .env.example templates. A single Redis connection is shared by queues, idempotency, and batching utilities.

Redis / Valkey Persistence

Redis / Valkey is used for:

Job queueing (BullMQ)
Persistent ingestion fingerprints
Resume-safe processing Redis persistence --must be enabled-- to preserve idempotency across restarts.

Supported persistence modes:

RDB snapshots (recommended)
AOF (append-only file)

When using Docker, a volume must be mounted to persist data.

Running the Application

From the project root, run:

npm install
npm run dev

This will:

Start Redis / Valkey
Start the backend API
Start the background worker
Start the frontend application

🔄 Background Worker Design

Workers consume jobs from a shared queue
Concurrency is intentionally limited
Files are read using normalized absolute paths
Extraction uses buffer-based loading for cross-platform safety
Batch-level retries occur only where failures happen
Invalid documents are skipped, not retried

📌 Current Project Status

Feature	Status
Authentication & uploads	✅
Asynchronous ingestion	✅
Multi-format extraction	✅
Cleaning & deduplication	✅
Embeddings & Pinecone storage	✅
Retrieval API	✅
LLM answer generation	❌

🧭 Design Philosophy

Clear separation of concerns (API, worker, pipelines)
Deterministic, retry-safe ingestion and retrieval
Infrastructure decoupled from business logic
Idempotency relies on persistent state as a correctness requirement
Prefer no ingestion over incorrect ingestion
LLM-ready but not LLM-dependent
Single shared Redis client to prevent connection sprawl

Batching & Retry Strategy

Batching applied only at network boundaries
Embeddings batched internally by LangChain
Pinecone indexing batched at the application level
Each batch logged with start / success / failure states
Retries scoped to failing batches, not entire documents
Failures propagate cleanly for queue-level retries

🔮 Planned Enhancements

LLM-based answer generation
Context window management
Source citations
Hybrid retrieval (keyword + vector)
Multi-document reasoning

📎 Summary

This project provides a production-grade ingestion and retrieval foundation for RAG systems.

The pipelines are deterministic, idempotent, cost-aware, and designed to run safely at scale without duplicate data or unbounded embedding costs.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
client/web		client/web
scripts		scripts
server		server
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF RAG Assistant

🚀 Key Features

Ingestion Pipeline

Retrieval Pipeline

🛡 Guardrails & Cost Control

🏗 Architecture Overview

Tech Stack

Frontend

Backend

AI / Vector Infrastructure

Infrastructure

📁 Project Structure

Local Setup

Prerequisites

Environment Variables

Redis / Valkey Persistence

Running the Application

🔄 Background Worker Design

📌 Current Project Status

🧭 Design Philosophy

Batching & Retry Strategy

🔮 Planned Enhancements

📎 Summary

About

Uh oh!

Releases 1

Packages

Languages

ankit123nag/pdf-rag-assistant

Folders and files

Latest commit

History

Repository files navigation

PDF RAG Assistant

🚀 Key Features

Ingestion Pipeline

Retrieval Pipeline

🛡 Guardrails & Cost Control

🏗 Architecture Overview

Tech Stack

Frontend

Backend

AI / Vector Infrastructure

Infrastructure

📁 Project Structure

Local Setup

Prerequisites

Environment Variables

Redis / Valkey Persistence

Running the Application

🔄 Background Worker Design

📌 Current Project Status

🧭 Design Philosophy

Batching & Retry Strategy

🔮 Planned Enhancements

📎 Summary

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages