An end-to-end Retrieval-Augmented Generation (RAG) system and conversational agent for yoga knowledge. Provides context-aware answers and guidance on yoga poses, breathing techniques (pranayama), and sequencing.
This project demonstrates full-stack RAG system design with rigorous retrieval evaluation, LLM model benchmarking, monitoring, and production deployment.
- Project Status
- Final Retrieval Results
- Quick Start
- Usage Examples
- Project Structure
- Experiments Completed
- Monitoring Dashboard
- Technology Stack
- Data
- Documentation
- Experiment Findings Summary
- License
Status: β Complete
- Hybrid search (BM25 + vector, alpha=0.4)
- LLM-powered RAG pipeline
- Streamlit conversational UI
- PostgreSQL logging + conversation history
- Feedback system (thumbs up/down)
- Grafana dashboard (7 visualizations)
- Docker Compose deployment
- Embedding cache for fast startup
- 202-pose yoga dataset with full metadata
- Hybrid search integrated in production
- RAG pipeline implemented end-to-end
- LLM model evaluation + prompt benchmarking
- Query rewriting evaluation (rejected)
- Document re-ranking evaluation (rejected)
- Full documentation, notebooks, experiments
- Monitoring setup with Grafana
- Docker stack complete
| Approach | Hit Rate | MRR | Status |
|---|---|---|---|
| BM25 Text Search | 76.0% | 53.5% | β Baseline |
| Vector Search | 69.3% | 58.3% | β Baseline |
| Hybrid Weighted Product | 76.0% | 66.0% | β BEST (+23% MRR) |
Key Achievement: Hybrid search improved MRR by 23% while maintaining BM25's recall!
Comparison of different retrieval approaches
Optimal alpha value (0.4) for hybrid search
# Recommended setup for yoga-assistant
BM25: all 7 fields (pose_name, sanskrit_name, category, difficulty_level, benefits, contraindications, instructions)
Vector: all-mpnet-base-v2 (768 dimensions)
Hybrid: Weighted Product with alpha=0.4
Top-K: 5 results- Docker and Docker Compose
- Hyperbolic API key (get from https://app.hyperbolic.xyz/)
One command to run everything:
# 1. Set up environment
cp .env.template .env
# Edit .env and add your Hyperbolic API key
# 2. Start everything
docker-compose up --buildThat's it!
- App: http://localhost:8501
- Grafana: http://localhost:3000 (admin/admin)
Database initialization happens automatically. No manual setup needed.
See DOCKER.md for advanced configuration and troubleshooting.
For developers who want to modify code:
# 1. Start database only
docker-compose up -d postgres grafana
# 2. Install dependencies
uv sync
# 3. Set up environment
cp .env.template .env
# Edit .env and add your API key
# 4. Initialize database (first time only)
python setup_database.py
# 5. Run app locally
streamlit run yoga_assistant/app.py# Open Jupyter notebooks
jupyter notebook notebooks/
# Key notebooks:
# - 03-retrieval-experiments.ipynb - Retrieval evaluation
# - 04-rag-experiments.ipynb - RAG pipeline experiments
# - 05-llm-evaluation.ipynb - LLM model comparisonQuestion: "How do I do downward facing dog?"
Answer:
Downward-Facing Dog (Adho Mukha Svanasana) Instructions
Category: Standing | Difficulty: Beginner
Step-by-Step Instructions:
1. Start on all fours with hands shoulder-width apart
2. Walk your hands forward a few inches
3. Lift your hips up and back, straightening arms and legs
4. Engage your core muscles
5. Keep palms and heels grounded
6. Hold for 3-5 breaths
Tips and Modifications:
- Use blocks or strap for wrist support
- Bend knees slightly for shoulder/back issues
- Pregnant women: keep knees bent, hips lifted slightly
Retrieved Poses: Downward-Facing Dog, One-Legged Downward-Facing Dog, Upward-Facing Dog
Response Time: 2.0s | Tokens: 1,894
Question: "How long should I hold Warrior II?"
Answer:
Holding Time for Warrior II (Virabhadrasana II)
Recommended: 3-5 breaths per side
This allows you to:
- Establish a strong foundation
- Feel the stretch in chest and shoulders
- Improve balance and focus
- Cultivate confidence and inner strength
Tips:
- Start with 2-3 breaths and gradually increase
- Focus on steady, smooth breathing
- Engage core and maintain proper alignment
- Release immediately if you feel discomfort
Retrieved Poses: Warrior Pose I, Warrior Pose II, Warrior II with a Twist
Response Time: 2.2s | Tokens: 1,701
yoga-assistant-knowledge-rag/
βββ data/
β βββ yoga_data_merged.csv # 202 yoga poses with full details
β βββ ground_truth.csv # 75 test queries for evaluation
β
βββ notebooks/ # Experimentation and evaluation
β βββ 01-data-generation.ipynb
β βββ 02-ground-truth-data-generation.ipynb
β βββ 03-retrieval-experiments.ipynb
β βββ 04-rag-experiments.ipynb
β βββ 05-llm-evaluation.ipynb
β
βββ yoga_assistant/ # Production application code
β βββ app.py # Streamlit UI
β βββ rag.py # RAG pipeline with LLM
β βββ retrieval.py # Hybrid search (BM25 + Vector)
β βββ ingest.py # Data loading and validation
β βββ db.py # Database operations
β βββ db_prep.py # Database schema initialization
β
βββ grafana/ # Monitoring setup
β βββ dashboard.json # Dashboard configuration (7 panels)
β βββ init.py # Automated setup script
β βββ README.md # Setup instructions
β
βββ assets/ # Images and screenshots
β βββ yoga_assistant_ui.png
β βββ grafana_dashboard.png
β βββ retrieval_comparison.png
β
βββ Dockerfile # Application container
βββ docker-compose.yml # Full stack (app, postgres, grafana)
βββ docker-entrypoint.sh # Container startup script
βββ .dockerignore # Docker build exclusions
βββ setup_database.py # Local dev DB initialization
βββ pyproject.toml # Dependencies (uv)
βββ uv.lock # Locked dependency versions
βββ .env.template # Environment variables template
βββ .env # Configuration (not in git)
βββ DOCKER.md # Docker deployment guide
βββ README.md # This file
- Best Config: All 7 fields, top_k=5
- Hit Rate: 76.0%
- MRR: 53.5%
- Strengths: Excellent keyword matching, fast
- Weaknesses: Misses semantic queries, lower ranking quality
- Best Model: all-mpnet-base-v2 (768 dims)
- Hit Rate: 69.3%
- MRR: 58.3%
- Strengths: Semantic understanding, better ranking
- Weaknesses: Lower recall than BM25
Tested three combination strategies:
- RRF (Reciprocal Rank Fusion): 69.3% hit rate, 60.8% MRR
- Weighted Sum: 76.0% hit rate, 65.8% MRR
- Weighted Product: 76.0% hit rate, 66.0% MRR β BEST
Winner: Weighted Product with alpha=0.4
- Maintains BM25's recall (76.0%)
- Improves ranking by 23% (66.0% vs 53.5% MRR)
- Balances keyword matching with semantic understanding
- Complete RAG Flow: Retrieval β Context Assembly β LLM Generation
- LLM: meta-llama/Meta-Llama-3.1-70B-Instruct via Hyperbolic API
- Performance: ~1.5s avg response time, ~1,700 tokens per query
- Status: Working end-to-end pipeline in notebook
Tested: Using LLM to enhance/clarify user questions before retrieval
Results (30 test queries):
| Metric | Without Rewriting | With Rewriting | Change |
|---|---|---|---|
| Hit Rate | 83.33% | 66.67% | -16.67% β |
| MRR | 75.28% | 54.17% | -21.11% β |
| Avg Latency | 1.52s | 2.44s | +60.4% β |
Decision: β DO NOT USE query rewriting in production
Why it failed:
- LLM rewrites are too verbose and technical
- Example: "What poses help with balance?" β "What yoga asanas are beneficial for improving balance and equilibrium, particularly those that target the ankles, calves, and core muscles?"
- Database uses simple language; technical terms don't match
- Adds 60% latency with significantly worse results
Lesson: Not all RAG "best practices" help every system. Simple user queries work better than LLM-enhanced ones for this use case.
Tested: Vector-based re-ranking on top of hybrid search results
Approach: Retrieve top 10 with hybrid search, then re-rank to top 5 using pure vector similarity
Results (30 test queries):
| Metric | Hybrid Search | Hybrid β Vector Re-rank | Change |
|---|---|---|---|
| Hit Rate | 83.33% | 80.00% | -3.33% β |
| MRR | 75.28% | 68.44% | -6.83% β |
| Avg Latency | 1.67s | 1.46s | -12.5% β |
Decision: β DO NOT USE re-ranking in production
Why it failed:
- Signal degradation: Re-ranking throws away the BM25 component
- Hybrid search uses:
score = BM25^0.4 Γ Vector^0.6(optimized balance) - Re-ranking uses:
score = Vector only(loses keyword matching) - Using the SAME vector embeddings that were already in hybrid search can't add new information
- It only removes the carefully tuned BM25 signal, making results worse
What would work:
- LLM-based re-ranking (too expensive/slow)
- Cross-encoder model (adds complexity)
- Different embedding model (marginal gains)
- None worth the complexity for 202 documents
Lesson: Re-ranking only helps when you add a NEW or BETTER signal. Using the same signal that was already in the first stage degrades the optimized balance.
Tested: 5 LLM models with 3 prompt templates (15 combinations) on 20 ground truth questions
Models Evaluated:
- DeepSeek-R1 (reasoning-focused)
- DeepSeek-V3 (general purpose)
- Qwen2.5-72B-Instruct
- Meta-Llama-3.1-70B-Instruct
- Hermes-3-Llama-3.1-70B (failed - API incompatible)
Prompt Templates:
- Concise: Brief, direct answers
- Detailed: Comprehensive explanations with context
- Structured: Clear, organized responses
Results (Top 3 by Quality Score):
| Model + Prompt | Quality Score | Relevant | Partly Relevant | Response Time | Tokens |
|---|---|---|---|---|---|
| DeepSeek-V3 + structured | 95.0% β | 90% | 10% | 9.9s | 1962 |
| Qwen2.5-72B + structured | 85.0% | 70% | 30% | 6.7s | 2001 |
| Qwen2.5-72B + detailed | 85.0% | 70% | 30% | 5.5s | 1947 |
| Llama-3.1-70B + structured | 82.5% | 65% | 35% | 2.5s | 1865 |
| DeepSeek-V3 + detailed | 82.5% | 65% | 35% | 6.7s | 1837 |
| Llama-3.1-70B + concise | 75.0% | 50% | 50% | 1.3s | 1695 |
| Hermes-3-Llama-3.1-70B + all | 0.0% | 0% | 0% | 2.9s | 0 |
Decision: β DeepSeek-V3 + structured prompt for production
Why it won:
- Highest relevance: 90% fully relevant answers (only 10% partly relevant, 0% non-relevant)
- Best quality score: 95% weighted quality (relevant Γ 1.0 + partly Γ 0.5)
- Meets primary criterion: >80% relevance target achieved
- Acceptable trade-off: 9.9s response time exceeds 5s target, but accuracy is critical for yoga guidance
Alternative: Qwen2.5-72B + detailed (85% quality, 5.5s) if speed is prioritized over accuracy
Key Findings:
- Structured prompts performed best across all models (avg 78.3% quality)
- DeepSeek-V3 achieved highest accuracy but slower responses (6-10s)
- Qwen2.5-72B offers best speed/quality balance (70% relevant, 5.5s)
- Llama-3.1-70B fastest but lower accuracy (50-65% relevant, 1.3-2.5s)
- DeepSeek-R1 moderate performance (40-60% relevant, 6-10s)
- Hermes-3 completely failed (100% errors, API incompatibility)
Evaluation Methodology:
- LLM-as-a-Judge using Llama-3.1-70B as evaluator
- Categories: RELEVANT / PARTLY_RELEVANT / NON_RELEVANT
- Measured: relevance, token usage, response time
- Sample size: 20 questions (limited by API costs)
Production Configuration:
LLM_MODEL=deepseek-ai/DeepSeek-V3
PROMPT_TEMPLATE=structured
TEMPERATURE=0.3
MAX_TOKENS=500-
Hybrid Search (Weighted Product, alpha=0.4)
Achieved 76% hit rate and 66% MRR, preserving BM25 recall while improving ranking by 23%.
Best balance of keyword and semantic signals. -
Simple User Queries
Natural phrasing outperforms LLM-enhanced rewrites.
No preprocessing needed. -
RAG Pipeline
End-to-end flow is stable with ~1.5s notebook latency and strong answer quality. -
DeepSeek-V3 + Structured Prompt
90% relevant answers and 95% quality score.
Best accuracy for yoga guidance across all models and templates.
-
Query Rewriting
- MRR: -21%
- Hit rate: -17%
- Latency: +60%
Too verbose, mismatches database language. Removed from production.
-
Document Re-ranking (same embeddings)
- MRR: -6.8%
- Hit rate: -3.3%
Throws away BM25 signal and provides no new information.
Only useful if powered by a new signal (LLM re-ranker, cross-encoder).
- Hybrid search (40% BM25, 60% vector) is difficult to beat for small datasets.
- Simple queries outperform rewritten ones for retrieval stability.
- Re-ranking only helps when adding a new or stronger signal.
- Structured prompts outperform concise/detailed across all LLMs.
- DeepSeek-V3 provides best accuracy; Qwen2.5-72B offers best speed/quality trade-off.
- For knowledge systems, accuracy is worth slower responses.
- Original 90/85% targets were unrealistic for a 202-document dataset.
- Always measure before adopting βbest practicesβ β most donβt generalize.
The system includes a comprehensive Grafana dashboard for monitoring performance and user feedback:
Real-time monitoring with 7 panels tracking conversations, feedback, cost, and performance
- Recent Conversations Table - Last 10 conversations with timestamps, questions, answers, and feedback
- User Feedback Distribution - Pie chart showing positive vs negative feedback
- Relevance Score - Gauge showing percentage of RELEVANT responses
- Model Usage Distribution - Bar chart showing which LLM models are being used
- LLM Cost Over Time - Time series tracking spending
- Token Usage Over Time - Time series monitoring token consumption
- Response Time Over Time - Time series tracking latency
# Start Grafana
docker-compose up -d grafana
# Initialize dashboard
python grafana/init.py
# Access dashboard
open http://localhost:3000/d/yoga-rag-dashboard
# Login: admin/adminSee grafana/README.md for detailed setup instructions.
- Language: Python 3.12
- Package Manager: uv
- Retrieval: rank-bm25, sentence-transformers (all-mpnet-base-v2)
- LLM: Hyperbolic API (OpenAI-compatible) with DeepSeek-V3
- Frontend: Streamlit
- Database: PostgreSQL 15
- Monitoring: Grafana
- Deployment: Docker + Docker Compose
- Yoga Poses: 202 poses with details (name, benefits, contraindications, instructions)
- Ground Truth: 75 test queries with known relevant poses
- Evaluation Metrics: Hit Rate, Mean Reciprocal Rank (MRR), Latency, Token Usage
notebooks/03-retrieval-experiments.ipynb- Retrieval approach evaluationnotebooks/04-rag-experiments.ipynb- RAG pipeline, query rewriting, and re-ranking experimentsnotebooks/05-llm-evaluation.ipynb- LLM model and prompt evaluationnotebooks/TASK_5.3_FINDINGS.md- Query rewriting evaluation findings- Task tracking in
.kiro/specs/yoga-rag-system/
MIT License - Open source project by Ramsi Kalia
