A fully offline Retrieval-Augmented Generation (RAG) system built in Python. This project allows you to ingest documents, generate and persist embeddings, and query them locally using an LLM — without any internet connection once models are installed.
The system is optimized for performance by persisting document embeddings and caching queries and responses in memory.
-
🔒 100% Offline RAG (after model setup)
-
📄 Document ingestion and chunking
-
🧠 Embedding generation using nomic-embed-text
-
🗄️ Vector search powered by FAISS
-
💾 Persistent embeddings stored as
.npyfiles (no re-embedding on restart) -
⚡ In-memory caching for:
- Query embeddings
- Model responses
-
🖥️ Simple command-line interface (CLI)
-
🧩 Modular and easy to extend
- LLM Runtime: Ollama
- LLM: llama3.2
- Embedding Model: nomic-embed-text
- Vector Database: FAISS
- Language: Python (3.8+)
-
Documents are ingested from disk
-
Documents are chunked into smaller text segments
-
Chunks are embedded using
nomic-embed-text -
Embeddings are:
- Indexed in FAISS
- Persisted to disk as
.npyfiles
-
User queries (via CLI) are:
- Embedded
- Cached in memory
-
Relevant chunks are retrieved via FAISS similarity search
-
Retrieved context is passed to
llama3.2 -
Final responses are:
- Returned to the user
- Cached in memory for faster repeat queries
⚠️ Note: Only document embeddings are persisted. Query and response caches are in-memory only for now.
Download and install Ollama from:
Once Ollama is installed, pull the required models:
ollama pull llama3.2
ollama pull nomic-embed-textThese models are stored locally and used fully offline.
git clone https://github.com/miracletim/faiss-rag-offline.git
cd faiss-rag-offlineEnsure you have Python 3.8 or higher, then run:
pip install -r requirements.txtSimply run the app entry point:
python app.pyThe system is self-guided and will:
- Inform you if required folders, files, or models are missing
- Guide you through the setup if something is not configured correctly
- Document embeddings are saved as
.npyfiles - On subsequent runs, embeddings are loaded from disk instead of recomputed
- This significantly improves startup and query performance
| Cached Item | Storage | Persisted |
|---|---|---|
| Document embeddings | Disk (.npy) |
✅ Yes |
| Query embeddings | Memory | ❌ No |
| LLM responses | Memory | ❌ No |
- Python 3.8+
- Ollama (installed locally)
- llama3.2 model
- nomic-embed-text model
All Python dependencies are listed in requirements.txt.
- Persist query & response cache
- Support for multiple embedding files
- Configurable chunk sizes
- Streaming responses
- Optional UI (web or desktop)
Contributions, ideas, and improvements are welcome. Feel free to fork the repo and submit a pull request.
MIT
Miracle Timothy Full Stack Developer | AI Systems Builder
"Offline-first AI systems are not a limitation — they are a design choice." 🚀