Skip to content

An end-to-end Retrieval-Augmented Generation (RAG) system and lightweight conversational agent for yoga knowledge. It retrieves and summarizes structured information on poses, breathing techniques, and sequencing to provide context-aware answers and guidance.

License

Notifications You must be signed in to change notification settings

Ramsi-K/yoga-assistant-knowledge-rag

Repository files navigation

Yoga Assistant Knowledge RAG

An end-to-end Retrieval-Augmented Generation (RAG) system and conversational agent for yoga knowledge. Provides context-aware answers and guidance on yoga poses, breathing techniques (pranayama), and sequencing.

Yoga Assistant UI

This project demonstrates full-stack RAG system design with rigorous retrieval evaluation, LLM model benchmarking, monitoring, and production deployment.

πŸ“‘ Table of Contents

🎯 Project Status

Status: βœ… Complete

✨ Features & Completion Summary

Core Features

  • Hybrid search (BM25 + vector, alpha=0.4)
  • LLM-powered RAG pipeline
  • Streamlit conversational UI
  • PostgreSQL logging + conversation history
  • Feedback system (thumbs up/down)
  • Grafana dashboard (7 visualizations)
  • Docker Compose deployment
  • Embedding cache for fast startup
  • 202-pose yoga dataset with full metadata

Completed Work

  • Hybrid search integrated in production
  • RAG pipeline implemented end-to-end
  • LLM model evaluation + prompt benchmarking
  • Query rewriting evaluation (rejected)
  • Document re-ranking evaluation (rejected)
  • Full documentation, notebooks, experiments
  • Monitoring setup with Grafana
  • Docker stack complete

πŸ“Š Final Retrieval Results

Best Configuration: Hybrid Weighted Product (alpha=0.4)

Approach Hit Rate MRR Status
BM25 Text Search 76.0% 53.5% βœ… Baseline
Vector Search 69.3% 58.3% βœ… Baseline
Hybrid Weighted Product 76.0% 66.0% βœ… BEST (+23% MRR)

Key Achievement: Hybrid search improved MRR by 23% while maintaining BM25's recall!

Retrieval Comparison Comparison of different retrieval approaches

Hybrid Search Alpha Tuning Optimal alpha value (0.4) for hybrid search

Production Configuration

# Recommended setup for yoga-assistant
BM25: all 7 fields (pose_name, sanskrit_name, category, difficulty_level, benefits, contraindications, instructions)
Vector: all-mpnet-base-v2 (768 dimensions)
Hybrid: Weighted Product with alpha=0.4
Top-K: 5 results

πŸš€ Quick Start

Prerequisites

Option 1: Docker (Recommended) ⭐

One command to run everything:

# 1. Set up environment
cp .env.template .env
# Edit .env and add your Hyperbolic API key

# 2. Start everything
docker-compose up --build

That's it!

Database initialization happens automatically. No manual setup needed.

See DOCKER.md for advanced configuration and troubleshooting.


Option 2: Local Development

For developers who want to modify code:

# 1. Start database only
docker-compose up -d postgres grafana

# 2. Install dependencies
uv sync

# 3. Set up environment
cp .env.template .env
# Edit .env and add your API key

# 4. Initialize database (first time only)
python setup_database.py

# 5. Run app locally
streamlit run yoga_assistant/app.py

App: http://localhost:8501

Run Experiments

# Open Jupyter notebooks
jupyter notebook notebooks/

# Key notebooks:
# - 03-retrieval-experiments.ipynb - Retrieval evaluation
# - 04-rag-experiments.ipynb - RAG pipeline experiments
# - 05-llm-evaluation.ipynb - LLM model comparison

πŸ’¬ Usage Examples

Example 1: Pose Instructions

Question: "How do I do downward facing dog?"

Answer:

Downward-Facing Dog (Adho Mukha Svanasana) Instructions
Category: Standing | Difficulty: Beginner

Step-by-Step Instructions:
1. Start on all fours with hands shoulder-width apart
2. Walk your hands forward a few inches
3. Lift your hips up and back, straightening arms and legs
4. Engage your core muscles
5. Keep palms and heels grounded
6. Hold for 3-5 breaths

Tips and Modifications:
- Use blocks or strap for wrist support
- Bend knees slightly for shoulder/back issues
- Pregnant women: keep knees bent, hips lifted slightly

Retrieved Poses: Downward-Facing Dog, One-Legged Downward-Facing Dog, Upward-Facing Dog
Response Time: 2.0s | Tokens: 1,894


Example 2: Practice Guidance

Question: "How long should I hold Warrior II?"

Answer:

Holding Time for Warrior II (Virabhadrasana II)

Recommended: 3-5 breaths per side

This allows you to:
- Establish a strong foundation
- Feel the stretch in chest and shoulders
- Improve balance and focus
- Cultivate confidence and inner strength

Tips:
- Start with 2-3 breaths and gradually increase
- Focus on steady, smooth breathing
- Engage core and maintain proper alignment
- Release immediately if you feel discomfort

Retrieved Poses: Warrior Pose I, Warrior Pose II, Warrior II with a Twist
Response Time: 2.2s | Tokens: 1,701


πŸ“ Project Structure

yoga-assistant-knowledge-rag/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ yoga_data_merged.csv      # 202 yoga poses with full details
β”‚   └── ground_truth.csv          # 75 test queries for evaluation
β”‚
β”œβ”€β”€ notebooks/                    # Experimentation and evaluation
β”‚   β”œβ”€β”€ 01-data-generation.ipynb
β”‚   β”œβ”€β”€ 02-ground-truth-data-generation.ipynb
β”‚   β”œβ”€β”€ 03-retrieval-experiments.ipynb
β”‚   β”œβ”€β”€ 04-rag-experiments.ipynb
β”‚   └── 05-llm-evaluation.ipynb
β”‚
β”œβ”€β”€ yoga_assistant/               # Production application code
β”‚   β”œβ”€β”€ app.py                    # Streamlit UI
β”‚   β”œβ”€β”€ rag.py                    # RAG pipeline with LLM
β”‚   β”œβ”€β”€ retrieval.py              # Hybrid search (BM25 + Vector)
β”‚   β”œβ”€β”€ ingest.py                 # Data loading and validation
β”‚   β”œβ”€β”€ db.py                     # Database operations
β”‚   └── db_prep.py                # Database schema initialization
β”‚
β”œβ”€β”€ grafana/                      # Monitoring setup
β”‚   β”œβ”€β”€ dashboard.json            # Dashboard configuration (7 panels)
β”‚   β”œβ”€β”€ init.py                   # Automated setup script
β”‚   └── README.md                 # Setup instructions
β”‚
β”œβ”€β”€ assets/                       # Images and screenshots
β”‚   β”œβ”€β”€ yoga_assistant_ui.png
β”‚   β”œβ”€β”€ grafana_dashboard.png
β”‚   └── retrieval_comparison.png
β”‚
β”œβ”€β”€ Dockerfile                    # Application container
β”œβ”€β”€ docker-compose.yml            # Full stack (app, postgres, grafana)
β”œβ”€β”€ docker-entrypoint.sh          # Container startup script
β”œβ”€β”€ .dockerignore                 # Docker build exclusions
β”œβ”€β”€ setup_database.py             # Local dev DB initialization
β”œβ”€β”€ pyproject.toml                # Dependencies (uv)
β”œβ”€β”€ uv.lock                       # Locked dependency versions
β”œβ”€β”€ .env.template                 # Environment variables template
β”œβ”€β”€ .env                          # Configuration (not in git)
β”œβ”€β”€ DOCKER.md                     # Docker deployment guide
└── README.md                     # This file

πŸ”¬ Experiments Completed

1. BM25 Text Search βœ…

  • Best Config: All 7 fields, top_k=5
  • Hit Rate: 76.0%
  • MRR: 53.5%
  • Strengths: Excellent keyword matching, fast
  • Weaknesses: Misses semantic queries, lower ranking quality

2. Vector Embeddings Search βœ…

  • Best Model: all-mpnet-base-v2 (768 dims)
  • Hit Rate: 69.3%
  • MRR: 58.3%
  • Strengths: Semantic understanding, better ranking
  • Weaknesses: Lower recall than BM25

3. Hybrid Search βœ…

Tested three combination strategies:

  • RRF (Reciprocal Rank Fusion): 69.3% hit rate, 60.8% MRR
  • Weighted Sum: 76.0% hit rate, 65.8% MRR
  • Weighted Product: 76.0% hit rate, 66.0% MRR ⭐ BEST

Winner: Weighted Product with alpha=0.4

  • Maintains BM25's recall (76.0%)
  • Improves ranking by 23% (66.0% vs 53.5% MRR)
  • Balances keyword matching with semantic understanding

4. RAG Pipeline Implementation βœ…

  • Complete RAG Flow: Retrieval β†’ Context Assembly β†’ LLM Generation
  • LLM: meta-llama/Meta-Llama-3.1-70B-Instruct via Hyperbolic API
  • Performance: ~1.5s avg response time, ~1,700 tokens per query
  • Status: Working end-to-end pipeline in notebook

5. Query Rewriting Evaluation βœ… ❌

Tested: Using LLM to enhance/clarify user questions before retrieval

Results (30 test queries):

Metric Without Rewriting With Rewriting Change
Hit Rate 83.33% 66.67% -16.67% ❌
MRR 75.28% 54.17% -21.11% ❌
Avg Latency 1.52s 2.44s +60.4% ❌

Decision: ❌ DO NOT USE query rewriting in production

Why it failed:

  • LLM rewrites are too verbose and technical
  • Example: "What poses help with balance?" β†’ "What yoga asanas are beneficial for improving balance and equilibrium, particularly those that target the ankles, calves, and core muscles?"
  • Database uses simple language; technical terms don't match
  • Adds 60% latency with significantly worse results

Lesson: Not all RAG "best practices" help every system. Simple user queries work better than LLM-enhanced ones for this use case.

6. Document Re-ranking Evaluation βœ… ❌

Tested: Vector-based re-ranking on top of hybrid search results

Approach: Retrieve top 10 with hybrid search, then re-rank to top 5 using pure vector similarity

Results (30 test queries):

Metric Hybrid Search Hybrid β†’ Vector Re-rank Change
Hit Rate 83.33% 80.00% -3.33% ❌
MRR 75.28% 68.44% -6.83% ❌
Avg Latency 1.67s 1.46s -12.5% βœ“

Decision: ❌ DO NOT USE re-ranking in production

Why it failed:

  • Signal degradation: Re-ranking throws away the BM25 component
  • Hybrid search uses: score = BM25^0.4 Γ— Vector^0.6 (optimized balance)
  • Re-ranking uses: score = Vector only (loses keyword matching)
  • Using the SAME vector embeddings that were already in hybrid search can't add new information
  • It only removes the carefully tuned BM25 signal, making results worse

What would work:

  • LLM-based re-ranking (too expensive/slow)
  • Cross-encoder model (adds complexity)
  • Different embedding model (marginal gains)
  • None worth the complexity for 202 documents

Lesson: Re-ranking only helps when you add a NEW or BETTER signal. Using the same signal that was already in the first stage degrades the optimized balance.

7. LLM Model and Prompt Evaluation βœ…

Tested: 5 LLM models with 3 prompt templates (15 combinations) on 20 ground truth questions

Models Evaluated:

  • DeepSeek-R1 (reasoning-focused)
  • DeepSeek-V3 (general purpose)
  • Qwen2.5-72B-Instruct
  • Meta-Llama-3.1-70B-Instruct
  • Hermes-3-Llama-3.1-70B (failed - API incompatible)

Prompt Templates:

  • Concise: Brief, direct answers
  • Detailed: Comprehensive explanations with context
  • Structured: Clear, organized responses

Results (Top 3 by Quality Score):

Model + Prompt Quality Score Relevant Partly Relevant Response Time Tokens
DeepSeek-V3 + structured 95.0% ⭐ 90% 10% 9.9s 1962
Qwen2.5-72B + structured 85.0% 70% 30% 6.7s 2001
Qwen2.5-72B + detailed 85.0% 70% 30% 5.5s 1947
Llama-3.1-70B + structured 82.5% 65% 35% 2.5s 1865
DeepSeek-V3 + detailed 82.5% 65% 35% 6.7s 1837
Llama-3.1-70B + concise 75.0% 50% 50% 1.3s 1695
Hermes-3-Llama-3.1-70B + all 0.0% 0% 0% 2.9s 0

Decision: βœ… DeepSeek-V3 + structured prompt for production

Why it won:

  • Highest relevance: 90% fully relevant answers (only 10% partly relevant, 0% non-relevant)
  • Best quality score: 95% weighted quality (relevant Γ— 1.0 + partly Γ— 0.5)
  • Meets primary criterion: >80% relevance target achieved
  • Acceptable trade-off: 9.9s response time exceeds 5s target, but accuracy is critical for yoga guidance

Alternative: Qwen2.5-72B + detailed (85% quality, 5.5s) if speed is prioritized over accuracy

Key Findings:

  • Structured prompts performed best across all models (avg 78.3% quality)
  • DeepSeek-V3 achieved highest accuracy but slower responses (6-10s)
  • Qwen2.5-72B offers best speed/quality balance (70% relevant, 5.5s)
  • Llama-3.1-70B fastest but lower accuracy (50-65% relevant, 1.3-2.5s)
  • DeepSeek-R1 moderate performance (40-60% relevant, 6-10s)
  • Hermes-3 completely failed (100% errors, API incompatibility)

Evaluation Methodology:

  • LLM-as-a-Judge using Llama-3.1-70B as evaluator
  • Categories: RELEVANT / PARTLY_RELEVANT / NON_RELEVANT
  • Measured: relevance, token usage, response time
  • Sample size: 20 questions (limited by API costs)

Production Configuration:

LLM_MODEL=deepseek-ai/DeepSeek-V3
PROMPT_TEMPLATE=structured
TEMPERATURE=0.3
MAX_TOKENS=500

πŸ” Findings & Insights

What Works

  • Hybrid Search (Weighted Product, alpha=0.4)
    Achieved 76% hit rate and 66% MRR, preserving BM25 recall while improving ranking by 23%.
    Best balance of keyword and semantic signals.

  • Simple User Queries
    Natural phrasing outperforms LLM-enhanced rewrites.
    No preprocessing needed.

  • RAG Pipeline
    End-to-end flow is stable with ~1.5s notebook latency and strong answer quality.

  • DeepSeek-V3 + Structured Prompt
    90% relevant answers and 95% quality score.
    Best accuracy for yoga guidance across all models and templates.

What Doesn’t Work

  • Query Rewriting

    • MRR: -21%
    • Hit rate: -17%
    • Latency: +60%
      Too verbose, mismatches database language. Removed from production.
  • Document Re-ranking (same embeddings)

    • MRR: -6.8%
    • Hit rate: -3.3%
      Throws away BM25 signal and provides no new information.
      Only useful if powered by a new signal (LLM re-ranker, cross-encoder).

Key Lessons

  • Hybrid search (40% BM25, 60% vector) is difficult to beat for small datasets.
  • Simple queries outperform rewritten ones for retrieval stability.
  • Re-ranking only helps when adding a new or stronger signal.
  • Structured prompts outperform concise/detailed across all LLMs.
  • DeepSeek-V3 provides best accuracy; Qwen2.5-72B offers best speed/quality trade-off.
  • For knowledge systems, accuracy is worth slower responses.
  • Original 90/85% targets were unrealistic for a 202-document dataset.
  • Always measure before adopting β€œbest practices” β€” most don’t generalize.

πŸ“Š Monitoring Dashboard

The system includes a comprehensive Grafana dashboard for monitoring performance and user feedback:

Grafana Dashboard Real-time monitoring with 7 panels tracking conversations, feedback, cost, and performance

Dashboard Features

  • Recent Conversations Table - Last 10 conversations with timestamps, questions, answers, and feedback
  • User Feedback Distribution - Pie chart showing positive vs negative feedback
  • Relevance Score - Gauge showing percentage of RELEVANT responses
  • Model Usage Distribution - Bar chart showing which LLM models are being used
  • LLM Cost Over Time - Time series tracking spending
  • Token Usage Over Time - Time series monitoring token consumption
  • Response Time Over Time - Time series tracking latency

Setup Monitoring

# Start Grafana
docker-compose up -d grafana

# Initialize dashboard
python grafana/init.py

# Access dashboard
open http://localhost:3000/d/yoga-rag-dashboard
# Login: admin/admin

See grafana/README.md for detailed setup instructions.

πŸ› οΈ Technology Stack

  • Language: Python 3.12
  • Package Manager: uv
  • Retrieval: rank-bm25, sentence-transformers (all-mpnet-base-v2)
  • LLM: Hyperbolic API (OpenAI-compatible) with DeepSeek-V3
  • Frontend: Streamlit
  • Database: PostgreSQL 15
  • Monitoring: Grafana
  • Deployment: Docker + Docker Compose

πŸ“š Data

  • Yoga Poses: 202 poses with details (name, benefits, contraindications, instructions)
  • Ground Truth: 75 test queries with known relevant poses
  • Evaluation Metrics: Hit Rate, Mean Reciprocal Rank (MRR), Latency, Token Usage

πŸ“– Documentation

  • notebooks/03-retrieval-experiments.ipynb - Retrieval approach evaluation
  • notebooks/04-rag-experiments.ipynb - RAG pipeline, query rewriting, and re-ranking experiments
  • notebooks/05-llm-evaluation.ipynb - LLM model and prompt evaluation
  • notebooks/TASK_5.3_FINDINGS.md - Query rewriting evaluation findings
  • Task tracking in .kiro/specs/yoga-rag-system/

πŸ“„ License

MIT License - Open source project by Ramsi Kalia

About

An end-to-end Retrieval-Augmented Generation (RAG) system and lightweight conversational agent for yoga knowledge. It retrieves and summarizes structured information on poses, breathing techniques, and sequencing to provide context-aware answers and guidance.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published