Skip to content

To build a Retrieval-Augmented Generation (RAG) system from scratch, you must first create an ingestion pipeline to load and process your data into a vector store (by chunking documents, generating embeddings, and storing them). Then, you build a retrieval pipeline to embed a user's query, search the vector store for relevant chunks, and pass these

Notifications You must be signed in to change notification settings

harishnemade100/Build-Retrieval-Augmented-Generation-Scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

14 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿง  Build Retrieval Augmented Generation (RAG) from Scratch

๐Ÿ“Œ Objective

This project demonstrates how to implement a multi-modal Retrieval Augmented Generation (RAG) system from scratch using:

  • PHI-3 Vision Model (for generation)
  • Jina-CLIP-V1 (for text + image embeddings)
  • ChromaDB (for vector database storage and retrieval)

We apply this system to the research paper "Attention is All You Need" to retrieve content and generate responses.


โš™๏ธ Components

  1. ๐Ÿ“„ Document Ingestion

    • Load PDF using fitz (PyMuPDF).
    • Extract both text and images from the paper.
    • Prepare data into chunks for embedding.
  2. ๐Ÿ” Embeddings Generation

    • Use Jina-CLIP-V1 from Hugging Face transformers.
    • Encode both text passages and images into embeddings.
  3. ๐Ÿ—„๏ธ Vector Database (ChromaDB)

    • Store embeddings in ChromaDB.
    • Each entry contains:
      • id
      • text
      • image (optional)
      • embedding
  4. ๐Ÿ“ฅ Retrieval Mechanism

    • On query, generate query embedding using Jina-CLIP-V1.
    • Retrieve top-k similar document segments from ChromaDB.
  5. ๐Ÿค– Generation Model (PHI-3 Vision)

    • Pass retrieved context to PHI-3 Vision model.
    • Generate context-aware text output (answers, summaries, explanations).

๐Ÿ–ผ๏ธ Workflow Diagram

๐Ÿ”น High-Level Architecture

flowchart TD
    A[User Query] --> B[Encode Query with Jina-CLIP-V1]
    B --> C[ChromaDB Retrieval]
    C -->|Top-k Relevant Segments| D[PHI-3 Vision Model]
    D --> E[Generated Output]
๐Ÿ”น Multi-Modal RAG Pipeline
mermaid
Copy code
flowchart LR
    P[PDF Ingestion] --> E1[Text Extraction]
    P --> E2[Image Extraction]
    E1 --> G1[Text Embeddings - Jina CLIP]
    E2 --> G2[Image Embeddings - Jina CLIP]
    G1 --> DB[ChromaDB Vector Store]
    G2 --> DB
    Q[User Query] --> Q1[Query Embedding]
    Q1 --> DB
    DB --> R[Retrieve Relevant Chunks]
    R --> M[PHI-3 Vision Model]
    M --> O[Generated Answer]
๐Ÿ› ๏ธ Installation
๐Ÿ“‹ Requirements
You are only allowed to use the following libraries:

torch

chromadb

numpy

io

fitz (PyMuPDF)

requests

PIL

transformers

๐Ÿ“ฅ Setup
bash
Copy code
# Clone repo
git clone https://github.com/your-username/RAG-from-scratch.git
cd RAG-from-scratch

# Create virtual env
python3 -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate

# Install allowed libraries
pip install torch chromadb numpy pymupdf requests pillow transformers
โ–ถ๏ธ Running the Project
Ingest the Document

bash
Copy code
python ingest.py
Extracts text + images from the PDF

Creates embeddings using Jina-CLIP-V1

Stores them in ChromaDB

Run the RAG System

bash
Copy code
python rag.py --query "Explain the concept of self-attention"
Expected Output

text
Copy code
๐Ÿ” Retrieved Context:
"Self-attention allows the model to weigh different parts of the input sequence..."

๐Ÿค– Generated Answer:
"Self-attention is a mechanism where each word in a sequence can attend to every other word..."
Loading

About

To build a Retrieval-Augmented Generation (RAG) system from scratch, you must first create an ingestion pipeline to load and process your data into a vector store (by chunking documents, generating embeddings, and storing them). Then, you build a retrieval pipeline to embed a user's query, search the vector store for relevant chunks, and pass these

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages