Skip to content

Production-grade RAG backend for document ingestion and semantic retrieval using embeddings and Pinecone.

Notifications You must be signed in to change notification settings

ankit123nag/pdf-rag-assistant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF RAG Assistant

A production-grade backend foundation for Retrieval-Augmented Generation (RAG)

A backend-focused system that allows authenticated users to upload documents and retrieve relevant content using a Retrieval-Augmented Generation (RAG) architecture.

This project emphasizes clean system boundaries, deterministic pipelines, asynchronous processing, and cost-aware vector workflows. Ingestion, retrieval, and future LLM layers are intentionally separated so each can evolve independently.

Document ingestion and semantic retrieval are fully implemented and operational. LLM-based answer generation is intentionally deferred.

🚀 Key Features

  • User authentication via Clerk
  • Secure document upload
  • Asynchronous processing using a queue and background workers
  • Multi-format document support (PDF, HTML, DOCX, TXT)
  • Robust document cleaning and normalization
  • Exact and near-duplicate content detection
  • Idempotent ingestion using content fingerprints
  • Deterministic, resume-safe chunking
  • Vector embeddings generation
  • Vector storage and search in Pinecone
  • Normalized, validated semantic retrieval pipeline
  • Modular, production-grade backend architecture

⚠️ LLM-based answer generation is intentionally not implemented yet. This repository focuses on ingestion and retrieval correctness first.

Ingestion Pipeline

The ingestion pipeline explicitly implements the following stages:

Raw Document
 → Text Extraction (format-aware)
 → Boilerplate Removal
 → Text Normalization
 → Exact Deduplication
 → Chunking (deterministic, resume-safe)
 → Near-Duplicate Detection
 → Metadata Enrichment
 → Quality Validation (guardrails + skip semantics)
 → Embeddings (LangChain-managed batching)
 → Batched Vector Indexing (retry-safe, observable)

This ensures high-quality, deduplicated, and retrieval-ready data while preventing runaway embedding costs.

Retrieval Pipeline

The retrieval pipeline is designed to produce stable, LLM-ready context:

User Query
 → Query Normalization & Deduplication
 → Query Embedding
 → Centralized Pinecone Search
 → Score-Based Reranking
 → Result Quality Validation
 → Clean Retrieval Output

Retrieval output is intentionally structured to support future LLM integration without refactors.

🛡 Guardrails & Cost Control

The system enforces explicit safeguards to ensure predictable behavior:

  • Hard chunk limits to prevent runaway documents
  • Low-information chunk filtering
  • Skip semantics for invalid documents (non-retryable)
  • Idempotent ingestion using Redis-backed content fingerprints
  • Batched Pinecone indexing with isolated retries
  • Namespace isolation to prevent cross-user data leakage
  • Retrieval result validation to avoid low-signal context Invalid documents are intentionally skipped, not failed.

🏗 Architecture Overview

Client (Next.js + Clerk)
        ↓
Express API
  ├─ Auth
  ├─ Upload
  └─ Retrieval
        ↓
Valkey / Redis (persistent)
  ├─ BullMQ (job queue)
  ├─ Fingerprint store (idempotent ingestion)
  └─ Shared Redis client
        ↓
Background Worker (ingestion only)
  ├─ Extraction
  ├─ Cleaning & Normalization
  ├─ Deduplication
  ├─ Chunking
  ├─ Validation
  ├─ Embeddings
  └─ Pinecone Indexing

Express API (retrieval path)
        ↓
Retrieval Pipeline
  ├─ Query normalization
  ├─ Query embedding
  ├─ Pinecone search
  ├─ Reranking
  └─ Result validation
  • Authentication enforced at the API boundary
  • All heavy processing is offloaded to background workers
  • Pipelines are deterministic and retry-safe

Tech Stack

Frontend

  • Next.js
  • TypeScript
  • Clerk Authentication

Backend

  • Node.js
  • Express
  • Multer
  • BullMQ (queue + worker)
  • Valkey / Redis

AI / Vector Infrastructure

  • LangChain
  • HuggingFace Inference embeddings
  • Pinecone (vector database)

Infrastructure

  • Docker
  • Docker Compose

📁 Project Structure

pdf_rag_assistant
├── server
│   ├── app.js                  # API server
│   ├── worker.js               # Background worker
│   ├── common/                 # Shared cross-pipeline logic
│   │   ├── embeddings/
│   │   ├── vectorStore/
│   │   ├── normalize/
│   │   ├── quality/
│   │   └── context/
│   ├── ingestion/
│   │   ├── extract/
│   │   ├── clean/
│   │   ├── dedupe/
│   │   ├── chunk/
│   │   ├── enrich/
│   │   ├── index/
│   │   └── pipeline.js
│   ├── retrieval/
│   │   ├── parse/
│   │   ├── embed/
│   │   ├── search/
│   │   ├── rerank/
│   │   └── pipeline.js
│   ├── routes/
│   ├── utils/
│   └── config/
├── client/web
├── scripts
├── docker-compose.yml

Local Setup

Prerequisites

  • Node.js (v18+)
  • Docker

Environment Variables

Create environment files using the provided .env.example templates. A single Redis connection is shared by queues, idempotency, and batching utilities.

Redis / Valkey Persistence

Redis / Valkey is used for:

  • Job queueing (BullMQ)
  • Persistent ingestion fingerprints
  • Resume-safe processing Redis persistence --must be enabled-- to preserve idempotency across restarts.

Supported persistence modes:

  • RDB snapshots (recommended)
  • AOF (append-only file)

When using Docker, a volume must be mounted to persist data.

Running the Application

From the project root, run:

npm install
npm run dev

This will:

  • Start Redis / Valkey
  • Start the backend API
  • Start the background worker
  • Start the frontend application

🔄 Background Worker Design

  • Workers consume jobs from a shared queue
  • Concurrency is intentionally limited
  • Files are read using normalized absolute paths
  • Extraction uses buffer-based loading for cross-platform safety
  • Batch-level retries occur only where failures happen
  • Invalid documents are skipped, not retried

📌 Current Project Status

Feature Status
Authentication & uploads
Asynchronous ingestion
Multi-format extraction
Cleaning & deduplication
Embeddings & Pinecone storage
Retrieval API
LLM answer generation

🧭 Design Philosophy

  • Clear separation of concerns (API, worker, pipelines)
  • Deterministic, retry-safe ingestion and retrieval
  • Infrastructure decoupled from business logic
  • Idempotency relies on persistent state as a correctness requirement
  • Prefer no ingestion over incorrect ingestion
  • LLM-ready but not LLM-dependent
  • Single shared Redis client to prevent connection sprawl

Batching & Retry Strategy

  • Batching applied only at network boundaries
  • Embeddings batched internally by LangChain
  • Pinecone indexing batched at the application level
  • Each batch logged with start / success / failure states
  • Retries scoped to failing batches, not entire documents
  • Failures propagate cleanly for queue-level retries

🔮 Planned Enhancements

  • LLM-based answer generation
  • Context window management
  • Source citations
  • Hybrid retrieval (keyword + vector)
  • Multi-document reasoning

📎 Summary

This project provides a production-grade ingestion and retrieval foundation for RAG systems.

The pipelines are deterministic, idempotent, cost-aware, and designed to run safely at scale without duplicate data or unbounded embedding costs.