A demo for my talk at PyData Eindhoven 2025. Retrieval-Augmented Generation (RAG) system that ingests EPUB files from S3, creates vector embeddings, and deploys a question-answering service using Ray Serve on Anyscale.
Based on the Ray e2e-rag example.
- Overview
- Architecture
- Quick Start
- Deployment Guide
- Configuration
- Performance Tuning
- Troubleshooting
- Development
- Project Structure
- License
This project processes EPUB books (e.g., Wheel of Time series) and enables semantic question-answering using:
- Ray Data for distributed data processing
- ChromaDB for vector storage
- Sentence Transformers for embeddings
- Anyscale LLM Service for text generation (deployed separately)
- Ray Serve for scalable microservices architecture
┌─────────────────────────────┐
│ Anyscale LLM Service │
│ (Mistral-7B-v0.3 + vLLM) │
│ 32K context window │
│ Persistent, External │
└────────────┬────────────────┘
│ HTTP/OpenAI API
┌────────────▼────────────────┐
User ──────────────►│ QAGateway (8000) │
│ (FastAPI) │
└────────────┬────────────────┘
│
┌────────────▼────────────────┐
│ QA Service │
│ (orchestrates RAG) │
└──────┬──────────────┬────────┘
│ │
┌──────▼──────────┐ │
│ Retriever │ │
│ (two-stage) │ │
└──────┬──────────┘ │
│ │
┌───────────────┼──────────────┼─────┐
│ │ │ │
┌──────▼─────┐ ┌─────▼─────┐ ┌────▼────┐│
│QueryEncoder│ │ Vector │ │Reranker ││
│ (GPU) │ │ Store │ │ (GPU) ││
│ │ │ (ChromaDB)│ │Cross-Enc││
└────────────┘ └───────────┘ └─────────┘│
│
┌──────────▼──────┐
│ LLMClient │
│(calls external) │
└─────────────────┘
Two-Stage Retrieval Flow:
1. QueryEncoder: Embed query → vector
2. VectorStore: Retrieve 45 candidates (high recall)
3. Reranker: Rerank to top 15 (high precision)
4. LLMClient: Generate answer from reranked contextS3 EPUBs → Parse → Chunk → Embed (GPU) → ChromaDBKey components:
download_books(): Download from S3parse_epub_content(): Extract text from EPUBsChunker: Split text into chunksEmbedder: Generate embeddings (GPU-accelerated)ChromaWriter: Store in ChromaDB
The system consists of two independent Anyscale services:
-
LLM Service (
services/llm_service.yaml)- Runs Mistral-7B-Instruct-v0.3 model (upgraded from v0.1)
- 32K token context window (4x larger than v0.1's 8K)
- OpenAI-compatible API at
/v1/chat/completions - Auto-scales based on load
-
RAG QA Service (
services/services/rag_service.yaml)QueryEncoder: Encode user queries into embeddings (0.1 GPU per replica)VectorStore: Query ChromaDB for relevant document chunksReranker: Rerank candidates using cross-encoder for precision (0.1 GPU per replica) [NEW]Retriever: Orchestrate two-stage retrieval (vector search → reranking)LLMClient: Call external Anyscale LLM service (or OpenAI)QA: Combine retrieval + generation workflowQAGateway(port 8000): FastAPI HTTP interface for user queries
Two-Stage Retrieval: By default, retrieves 45 candidates with vector search, then reranks to select the top 15 most relevant chunks. This gives better precision than single-stage retrieval while keeping context manageable.
- Anyscale CLI installed:
pip install anyscale - Anyscale workspace with:
- GPU quota (for embeddings and LLM)
- Cluster storage access (for ChromaDB)
- uv package manager:
curl -LsSf https://astral.sh/uv/install.sh | sh
You can deploy this system in three ways:
- Anyscale Job + Services (Recommended) - Job for ingestion, persistent services
- Workspace + Services - Manual ingestion in workspace, persistent services
- Local Development - Everything runs in your workspace session
This is the recommended approach for production deployments. You don't need an active workspace!
Submit the ingestion job (no workspace needed):
anyscale job submit -f services/ingestion_job.yamlThis job will:
- Download EPUBs from S3
- Generate embeddings (GPU-accelerated)
- Create ChromaDB vector store
- Automatically upload to
$ANYSCALE_ARTIFACT_STORAGE/vector_store.tar.gz
The job uses Anyscale's built-in ANYSCALE_ARTIFACT_STORAGE environment variable (automatically set).
Monitor the job:
anyscale job logs <job-id> --followWhen complete, the job will print the S3 path where the vector store was uploaded.
Once ingestion completes, deploy the services from anywhere (no workspace needed):
# 1. Deploy LLM service
anyscale service deploy -f services/llm_service.yaml
# 2. Get LLM service URL and update services/rag_service.yaml:
# env_vars:
# LLM_SERVICE_BASE_URL: "https://your-llm-url.com/v1"
# LLM_API_KEY: "your-token"
# VECTOR_STORE_S3_PATH: "s3://your-bucket/artifacts/vector_store.tar.gz"
# 3. Deploy RAG service (downloads vector store from S3 at startup)
anyscale service deploy -f services/rag_service.yamlKey Benefits:
- ✅ No workspace needed - run from anywhere
- ✅ Automatic S3 upload/download
- ✅ Isolated ingestion job (auto-terminates when done)
- ✅ Services download vector store from S3
Run ingestion in a workspace, then deploy as services.
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install project dependencies
uv syncFrom your Anyscale workspace:
# Run ingestion (automatically uploads to S3)
uv run main.py ingestThis creates the ChromaDB vector store and uploads it to $ANYSCALE_ARTIFACT_STORAGE/vector_store.tar.gz.
Note: ANYSCALE_ARTIFACT_STORAGE is automatically set in Anyscale workspaces.
Use the automated deployment script (handles S3 upload automatically):
# Deploy both LLM and RAG services
./scripts/deploy_anyscale.sh bothThe script will automatically:
- Deploy the LLM service
- Upload the vector store to S3 (
$ANYSCALE_ARTIFACT_STORAGE) - Deploy the RAG service (which downloads from S3 at startup)
Or deploy individually:
# Deploy LLM service only
./scripts/deploy_anyscale.sh llm
# Get LLM service credentials from Anyscale console and update services/rag_service.yaml
# Deploy RAG service (uploads vector store to S3 first)
./scripts/deploy_anyscale.sh ragGet your service credentials from Anyscale console, then:
# Option A: Pass credentials as command-line arguments
uv run main.py query "What is the main theme?" \
--service-url https://YOUR_RAG_SERVICE_URL/rag \
--auth-token YOUR_SERVICE_TOKEN \
--auth-version YOUR_SERVICE_VERSION
# Option B: Set environment variables (recommended)
export RAG_SERVICE_URL="https://YOUR_RAG_SERVICE_URL/rag"
export ANYSCALE_SERVICE_TOKEN="YOUR_SERVICE_TOKEN"
export ANYSCALE_SERVICE_VERSION="YOUR_SERVICE_VERSION"
# Option C: Use a .env file (even better!)
cp .env.example .env
# Edit .env with your credentials, then:
uv run main.py query "What is the main theme?"
# Then query without repeating credentials
uv run main.py query "What is the main theme?"
# Interactive mode
uv run main.py query --interactive
# With more context
uv run main.py query "What is the main theme?" --top-k 20Environment Variables:
RAG_SERVICE_URL: Your deployed RAG service URL (default:http://localhost:8000/rag)ANYSCALE_SERVICE_TOKEN: Bearer token for authenticationANYSCALE_SERVICE_VERSION: Service version header
Note: The first query after a fresh deployment triggers the RAG to retrieve the embeddings. This takes a few minutes, so the query is likely to time out on the client's side. Subsequent queries should work without issue.
Tip: Copy .env.example to .env and configure your credentials there. The script will automatically load them.
For development and testing, run services locally in your workspace:
anyscale service deploy -f services/llm_service.yamlGet service URL and add to .env:
echo "LLM_SERVICE_BASE_URL=https://your-llm-service-url.com/v1" > .env
echo "LLM_API_KEY=your-token" >> .env# Skip S3 upload for local development
uv run main.py ingest --no-uploaduv run main.py deploy# Single query
uv run main.py query "What is the main theme of the books?"
# Interactive mode
uv run main.py query --interactiveThis uses the local RAG service at http://localhost:8000/rag.
Use the deployment script:
# Deploy both services
./scripts/deploy_anyscale.sh both
# Or deploy individually
./scripts/deploy_anyscale.sh llm # Deploy LLM only
./scripts/deploy_anyscale.sh rag # Deploy RAG onlyanyscale service deploy -f services/llm_service.yamlWait 5-10 minutes for the service to:
- Download Mistral-7B model (~14GB)
- Load model into GPU memory
- Initialize vLLM engine
Get service credentials:
anyscale service listTest the LLM service:
curl -H "Authorization: Bearer YOUR_TOKEN" \
-H "X-ANYSCALE-VERSION: YOUR_VERSION" \
-H "Content-Type: application/json" \
https://YOUR_SERVICE_URL/v1/chat/completions \
-d '{"messages": [{"role": "user", "content": "Hello!"}], "model": "mistral-7b-instruct", "max_tokens": 50}'Before deploying RAG service, ingest your data:
# From your workspace
uv run main.py ingest
# Or with custom configuration
uv run main.py ingest --s3-path s3://your-bucket/path --chunk-size 4096This creates the ChromaDB vector store at /mnt/cluster_storage/vector_store.
Verify ingestion:
ls -lh /mnt/cluster_storage/vector_storeUpload vector store to S3:
The vector store needs to be uploaded to S3 so the Anyscale service can access it:
# Create tarball
cd /mnt/cluster_storage
tar -czf vector_store.tar.gz vector_store/
# Upload to Anyscale artifact storage
aws s3 cp vector_store.tar.gz $ANYSCALE_ARTIFACT_STORAGE/vector_store.tar.gzThe RAG service will automatically download and extract this at startup.
Update services/rag_service.yaml with your LLM service credentials:
env_vars:
LLM_SERVICE_BASE_URL: "https://your-llm-service-url.com/v1"
LLM_API_KEY: "your-token"
LLM_MODEL: "mistral-7b-instruct"Alternatively, set these in .env file:
LLM_SERVICE_BASE_URL=https://your-llm-service-url.com/v1
LLM_API_KEY=your-token
LLM_MODEL=mistral-7b-instructanyscale service deploy -f services/services/rag_service.yamlThis takes 2-3 minutes to:
- Download embedding model
- Initialize ChromaDB connection
- Connect to LLM service
- Start autoscaling replicas
Get service URL:
anyscale service listHealth check:
curl -H "Authorization: Bearer YOUR_RAG_TOKEN" \
-H "X-ANYSCALE-VERSION: YOUR_RAG_VERSION" \
https://YOUR_RAG_SERVICE_URL/rag/Query the RAG:
uv run main.py query "What is the main theme?" --service-url https://YOUR_RAG_SERVICE_URL/rag --top-k 15anyscale service list
anyscale service status rag-qa-service
anyscale service status rag-llm-service# RAG service logs
anyscale service logs rag-qa-service
# LLM service logs
anyscale service logs rag-llm-service
# Follow logs in real-time
anyscale service logs rag-qa-service --followAfter making code changes:
# Update RAG service
anyscale service deploy -f services/rag_service.yaml
# Or force rollout
anyscale service rollout rag-qa-serviceEdit the YAML file's autoscaling configuration:
# In deploy_rag.py, update deployment decorators:
@serve.deployment(
autoscaling_config={"min_replicas": 2, "max_replicas": 10},
)Then redeploy.
anyscale service terminate rag-qa-service
anyscale service terminate rag-llm-serviceCHUNK_SIZE = 4096 # 2x original size (was 2048)
CHUNK_OVERLAP = 1536 # 37% overlap (was 10%)Impact:
- Each chunk: ~4,000 characters (2-3 pages of text)
- Overlap: ~1,500 characters shared between adjacent chunks
- Result: Maximum context continuity, minimal information loss at boundaries
# Two-stage retrieval (default)
top_k = 15 # Final chunks after reranking
retrieval_k = 45 # Candidates before reranking (3x top_k)
use_reranking = True # Enable cross-encoder rerankingImpact with Two-Stage Retrieval:
- Stage 1 (Vector Search): Retrieves 45 candidates (high recall)
- Stage 2 (Reranking): Selects 15 most relevant (high precision)
- Final context: ~60,000 characters (15 chunks × 4096 chars)
- Equivalent to: ~30 pages of book text per query
- Better quality than single-stage retrieval with same context size
- Users can increase to top_k=20-30 with the larger 32K context window
| Metric | Original | Before Reranking | Current (with Reranking) |
|---|---|---|---|
| Chunk size | 2,048 | 4,096 | 4,096 |
| Overlap % | 10% | 37% | 37% |
| Retrieval method | Single-stage | Single-stage | Two-stage (retrieve + rerank) |
| Candidates retrieved | 3 | 15 | 45 |
| Final top_k | 3 | 15 | 15 |
| Reranking | No | No | Yes (cross-encoder) |
| Default context | ~6K chars | ~60K chars | ~60K chars |
| LLM context window | 8K tokens | 8K tokens | 32K tokens (Mistral-v0.3) |
| Max usable top_k | ~3-5 | ~10-15 | 20-30+ |
| Quality | Baseline | Better | Best |
The RAG service requires an external LLM service. You have two options:
Option 1: Anyscale LLM Service (Recommended)
Deploy the included Mistral-7B-Instruct-v0.3 service with 32K context window:
anyscale service deploy -f services/llm_service.yamlThis upgraded model (v0.3) provides:
- 4x larger context than v0.1 (32K vs 8K tokens)
- Support for higher top_k values (20-30+ chunks)
- Better handling of long contexts
Then configure the service URL in .env:
echo "LLM_SERVICE_BASE_URL=https://your-service-url.anyscale.com" > .envOption 2: OpenAI
Use OpenAI's API instead:
echo "OPENAI_API_KEY=sk-your-key" > .env
echo "LLM_MODEL=gpt-3.5-turbo" >> .env # 16K context
# Or use gpt-4-turbo for 128K contextEdit data_pipeline.py constants or use command-line arguments:
S3_BUCKET_PATH = "s3://your-bucket"
EMBEDDER_MODEL = "intfloat/multilingual-e5-large-instruct"
CHUNK_SIZE = 4096
CHUNK_OVERLAP = 1536uv run python main.py ingest --chunk-size 4096uv run python main.py ingest --embedding-model BAAI/bge-large-en-v1.5# Default: retrieve 45 candidates, rerank to top 15
uv run python main.py query "What happened?" --top-k 15# Retrieve 60 candidates, rerank to top 20 (uses more context)
uv run python main.py query "Who is Rand?" --top-k 20 --retrieval-k 60
# Disable reranking for faster (but lower quality) responses
uv run python main.py query "Quick test" --no-reranking --top-k 10# Two-stage retrieval via HTTP
curl "http://localhost:8000/rag/answer?query=What%20is%20magic?&top_k=15&retrieval_k=45&use_reranking=true"
# Single-stage retrieval (faster, lower quality)
curl "http://localhost:8000/rag/answer?query=Quick%20test&top_k=10&use_reranking=false"The RAG system has been optimized from the original baseline configuration:
Before:
- Chunk size: 2048 characters
- Overlap: 200 characters (~10%)
- Total chunks: 54,964
After:
- Chunk size: 4096 characters (+100% larger)
- Overlap: 1536 characters (~37%)
- Total chunks: ~9,000-11,000 (fewer but richer chunks)
Benefits:
- Each chunk contains significantly more context
- 37% overlap ensures important information isn't split
- Adjacent chunks share substantial context for continuity
- Better chance of capturing complete thoughts/descriptions
Before:
- Default top_k: 3 chunks
- Average context: ~6,000 characters
After:
- Default top_k: 15 chunks
- Average context: ~60,000 characters
Benefits:
- LLM receives 10x more context
- Higher chance of finding descriptive passages
- Can synthesize information across multiple mentions
- Better handles complex/multi-faceted questions
Before (Single-Stage):
- Vector search retrieves top_k=15 chunks directly
- Uses bi-encoder similarity (cosine distance)
- Fast but may miss relevant chunks ranked 16-45
After (Two-Stage):
- Stage 1: Vector search retrieves 45 candidates (3x more)
- Stage 2: Cross-encoder reranks to select top 15 most relevant
- Uses ms-marco-MiniLM-L-6-v2 cross-encoder
Benefits:
- Higher recall: Casts wider net in stage 1 (45 vs 15)
- Higher precision: Cross-encoder is more accurate than bi-encoder
- Better relevance: Cross-encoder encodes query+document together
- Same context size: Still uses 15 chunks, but they're the BEST 15
- Marginal overhead: Reranking 45 chunks takes ~50-100ms
Why cross-encoders are better:
- Bi-encoders (vector search): Encode query and document separately, then compare
- Cross-encoders (reranker): Encode query+document together, can capture interactions
- Cross-encoders are slower but much more accurate for ranking
Before:
- Model: Mistral-7B-Instruct-v0.1
- Context window: 8,192 tokens (~32K characters)
- Max usable top_k: ~10-15 chunks
After:
- Model: Mistral-7B-Instruct-v0.3
- Context window: 32,768 tokens (~128K characters)
- Max usable top_k: 20-30+ chunks
Benefits:
- 4x larger context window allows more chunks without overflow
- Can use higher top_k values for complex queries
- Better handling of long contexts
- No quality degradation with increased context
Before:
Given the following context from books:
{composed_context}
Answer the following question:
{query}
If you cannot provide an answer based on the context, please say "I don't know."
Do not use the term "context" in your response.After:
You are answering questions about The Wheel of Time book series by Robert Jordan.
Context from the books:
{composed_context}
Question: {query}
Instructions:
- Answer based on the provided passages
- Synthesize information across multiple passages if needed
- If the passages mention but don't fully explain something, provide what information is available
- Include relevant details like names, places, relationships when mentioned
- Only say "I don't know" if the passages contain absolutely no relevant information
- Be concise but informativeBenefits:
- Sets context (Wheel of Time series)
- Encourages synthesis across passages
- Allows partial answers when full info unavailable
- Requests relevant details (names, places, relationships)
- Less likely to say "I don't know" inappropriately
# Default: retrieve 45 candidates, rerank to top 15
uv run main.py query "Who is Rand?" --service-url YOUR_SERVICE_URL
# Or with explicit parameters
uv run main.py query "Who is Rand?" --service-url YOUR_SERVICE_URL --top-k 15 --retrieval-k 45# Use 20-25 chunks for comprehensive coverage
# Retrieves 60-75 candidates, reranks to top 20-25
uv run main.py query "Describe Rand's character arc" --service-url YOUR_SERVICE_URL --top-k 25 --retrieval-k 75# Maximum context with 32K context window
# Retrieves 90 candidates, reranks to top 30
uv run main.py query "Compare Rand and Mat's leadership styles" --service-url YOUR_SERVICE_URL --top-k 30 --retrieval-k 90# Disable reranking for faster responses (lower quality)
uv run main.py query "Quick answer" --service-url YOUR_SERVICE_URL --no-reranking --top-k 10Query: "Who is Tylin?"
Retrieved chunks: 3 small chunks (2048 chars each) mentioning Tylin LLM response: "I don't know."
Why it failed:
- Chunks mentioned Tylin but didn't describe her
- Too few chunks to synthesize information
- Prompt demanded certainty
- Small context window limited retrieval
Query: "Who is Tylin?"
Retrieval process:
- Vector search retrieves 45 candidates (larger search space)
- Cross-encoder reranks to select 15 most relevant chunks
- Each chunk is 4096 chars with 37% overlap
LLM response: Should provide detailed information about Tylin from multiple contexts
Why it works better:
- Two-stage retrieval: Higher chance of finding relevant chunks (45 candidates vs 3)
- Reranking: Cross-encoder selects BEST 15, not just closest 15
- Larger chunks: More complete context per chunk (4096 vs 2048)
- More overlap: Better narrative continuity (37% vs 10%)
- Larger LLM context: Can handle 30+ chunks if needed (32K vs 8K)
- Better prompt: Synthesizes partial information
| Metric | Baseline | After Chunking | After Reranking | With Larger Model |
|---|---|---|---|---|
| Chunks retrieved | 3 | 15 | 45 → 15 | 45 → 30+ |
| Chunk size | 2048 | 4096 | 4096 | 4096 |
| Retrieval method | Single-stage | Single-stage | Two-stage | Two-stage |
| Reranking | No | No | Yes | Yes |
| LLM context | 8K tokens | 8K tokens | 32K tokens | 32K tokens |
| Answer quality | Poor | Good | Better | Best |
| Latency | Fast | Fast | +50-100ms | +50-100ms |
- Download 15 EPUB books from S3
- Parse ~8,000 pages of text
- Chunk into ~9,000-11,000 chunks (fewer but larger)
- Generate 1024-dim embeddings with multilingual-e5-large-instruct
- Store in ChromaDB with metadata (book name, page, source)
- Processing time: ~15-20 minutes
- GPU usage: ~10-15 minutes for embedding generation
- ChromaDB size: ~500-600 MB
- Total chunks: ~9,000-11,000
Potential improvements for even better performance:
- Hybrid search: Combine vector search + keyword search (BM25 + dense retrieval)
- Query expansion: Automatically expand queries for better retrieval
- Metadata filtering: Filter by specific books, characters, or topics
- Iterative retrieval: Retrieve, then retrieve again based on initial results
- Larger LLM: Use more capable model (Llama-3.1-70B, GPT-4, Claude)
- Fine-tuned embeddings: Train embeddings specifically for Wheel of Time
- Graph RAG: Build knowledge graph of character relationships and plot events
- Advanced reranking: Use larger cross-encoders (e.g., ms-marco-electra-base) or domain-specific rerankers
Already Implemented:
- ✅ Two-stage retrieval with cross-encoder reranking
- ✅ Larger context window (Mistral-7B-v0.3 with 32K tokens)
-
Use fractional GPUs:
ray_actor_options={"num_gpus": 0.1}
-
Reduce autoscaling:
autoscaling_config={"min_replicas": 1, "max_replicas": 2}
-
Terminate when not in use:
anyscale service terminate rag-qa-service
-
Use smaller models:
- Consider Mistral-7B instead of Llama-70B
- Use quantized models (GPTQ, AWQ)
anyscale service metrics rag-qa-service
anyscale service metrics rag-llm-serviceRun data ingestion first:
uv run python main.py ingestThe RAG service requires an external LLM service to be configured:
-
Deploy the Anyscale LLM service (recommended):
anyscale service deploy -f services/llm_service.yaml
Then add the service URL to
.env:echo "LLM_SERVICE_BASE_URL=https://your-service-url.anyscale.com" > .env
-
Or use OpenAI:
echo "OPENAI_API_KEY=sk-your-key" > .env
Make sure the LLM service is deployed and configured before starting the RAG service.
Validate environment and test the service:
# Validate environment
uv run main.py validate
# Test a query
uv run main.py query "test" --service-url http://localhost:8000/ragService not responding:
- Check logs:
anyscale service logs rag-llm-service - Look for OOM (Out of Memory) errors
- Verify GPU quota and node availability
Model download timeout:
- Service will retry automatically
- Check network connectivity to HuggingFace
- May need to increase timeout in service config
Deployment errors:
- Verify you're using the correct image:
anyscale/ray-llm:2.52.1-py311-cu128 - Ensure
serve_llm.pyuses the correct LLMConfig API format with:model_loading_configdict containingmodel_idandmodel_sourceaccelerator_type(notaccelerator_type_or_device)deployment_configwith nestedautoscaling_configengine_kwargs(notengine_config)build_openai_app({"llm_configs": [llm_config]})format
- Check that
serve_llm.pyusesdict()not{}for config parameters - Ensure the import path is
serve_llm:app(notserve_llm:llm_app) - Check Anyscale console logs for specific errors
"LLM service not configured" error:
- Verify
LLM_SERVICE_BASE_URLis set inservices/rag_service.yaml - Test LLM service independently first
- Check authentication credentials
"Collection not found" error:
- Verify data ingestion completed:
ls /mnt/cluster_storage/vector_store - Verify vector store uploaded to S3: Check
$ANYSCALE_ARTIFACT_STORAGE/vector_store.tar.gzexists - Check
VECTOR_STORE_S3_PATHinservices/rag_service.yamlpoints to the correct S3 location - Re-run ingestion and upload if needed:
uv run main.py ingest cd /mnt/cluster_storage && tar -czf vector_store.tar.gz vector_store/ aws s3 cp vector_store.tar.gz $ANYSCALE_ARTIFACT_STORAGE/vector_store.tar.gz
Slow query responses:
- Check if LLM service is under load
- Increase
max_replicasfor autoscaling - Monitor with:
anyscale service metrics rag-qa-service
Encoding errors:
- Verify embedding model matches data ingestion
- Check GPU availability for QueryEncoder deployment
- Review logs for OOM errors
"Prefix / is being used" error:
- Services have conflicting route prefixes
- Use
/ragfor RAG service,/for LLM service - Update
route_prefixin YAML files
Resource quota exceeded:
- Check workspace quota:
anyscale workspace quota - Reduce
max_replicasin autoscaling config - Use fractional GPUs (0.1 GPU) where possible
GPU memory errors during ingestion:
- Reduce batch size in
data_pipeline.py:EMBED_BATCH_SIZE = 400 # Reduce from 800
Import errors:
- Ensure dependencies are installed:
uv sync
"ModuleNotFoundError" in Ray workers:
- If you see errors like
ModuleNotFoundError: No module named 'chromadb'during data pipeline execution, ensure dependencies are installed:uv sync
- For Ray clusters, you may need to specify a runtime environment
"AttributeError: module 'pyarrow' has no attribute 'Table'":
- This indicates a corrupted PyArrow installation. Fix it by reinstalling:
uv sync --reinstall-package pyarrow
- Verify the fix:
uv run python -c "import pyarrow as pa; print(pa.Table)"
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
uv syncRAG Service (must match your cluster versions):
- ray[serve]==2.52.1 (must match Anyscale cluster)
- pyarrow==19.0.1 (required for Anyscale)
- chromadb>=1.3.5
- sentence-transformers>=5.1.2
- fastapi>=0.115.0
- openai>=1.58.0
- ebooklib>=0.20
- langchain-text-splitters>=1.0.0
LLM Service (uses pre-built Anyscale image):
- Uses
anyscale/ray-llm:2.52.1-py311-cu128image - Ray Serve LLM and vLLM are pre-installed with compatible versions
- No manual dependency installation required
See pyproject.toml for complete RAG service dependencies.
uv run python main.py test.
├── main.py # Main orchestrator CLI
├── README.md # This file
├── pyproject.toml # Project dependencies
├── .env.example # Environment variable template
│
├── src/ # Main application code
│ ├── data_pipeline.py # Data ingestion (EPUB → ChromaDB)
│ └── deploy_rag.py # RAG service deployment
│
├── services/ # Service deployment configurations
│ ├── serve_llm.py # LLM service definition
│ ├── llm_service.yaml # Anyscale LLM service config
│ └── services/rag_service.yaml # Anyscale RAG service config
│
├── scripts/ # Utility scripts
│ ├── deploy_anyscale.sh # Automated deployment script
│ ├── inspect_chromadb.py # ChromaDB inspection tool
│ ├── test_search.py # Search testing tool
│ └── debug_llm_response.py # LLM response debugging
│
├── tests/ # Test files
│ └── test_deployment.py # Service testing
│
├── utils/ # Utility modules
│ └── init_logger.py # Logging configuration
│
└── examples/ # Example code
└── actors.py # Ray actors exampleBefore going to production:
- Load test your services
- Set appropriate autoscaling limits
- Configure monitoring and alerting
- Set up log aggregation
- Document service URLs and credentials
- Test failover scenarios
- Implement rate limiting
- Add authentication/authorization
- Configure CORS if needed for web access
- Set up CI/CD for automated deployments
- Back up your ChromaDB data
- Document disaster recovery procedures
MIT
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request