Skip to content

RCdeWit/ray-wheel-of-time

Repository files navigation

Ray RAG service for Wheel of Time

A demo for my talk at PyData Eindhoven 2025. Retrieval-Augmented Generation (RAG) system that ingests EPUB files from S3, creates vector embeddings, and deploys a question-answering service using Ray Serve on Anyscale.

Based on the Ray e2e-rag example.

Table of Contents

Overview

This project processes EPUB books (e.g., Wheel of Time series) and enables semantic question-answering using:

  • Ray Data for distributed data processing
  • ChromaDB for vector storage
  • Sentence Transformers for embeddings
  • Anyscale LLM Service for text generation (deployed separately)
  • Ray Serve for scalable microservices architecture

Architecture

System Overview

                    ┌─────────────────────────────┐
                    │  Anyscale LLM Service       │
                    │  (Mistral-7B-v0.3 + vLLM)  │
                    │  32K context window         │
                    │  Persistent, External       │
                    └────────────┬────────────────┘
                                 │ HTTP/OpenAI API
                    ┌────────────▼────────────────┐
User ──────────────►│     QAGateway (8000)        │
                    │    (FastAPI)                │
                    └────────────┬────────────────┘
                                 │
                    ┌────────────▼────────────────┐
                    │       QA Service             │
                    │   (orchestrates RAG)         │
                    └──────┬──────────────┬────────┘
                           │              │
                    ┌──────▼──────────┐   │
                    │    Retriever    │   │
                    │ (two-stage)     │   │
                    └──────┬──────────┘   │
                           │              │
           ┌───────────────┼──────────────┼─────┐
           │               │              │     │
    ┌──────▼─────┐  ┌─────▼─────┐  ┌────▼────┐│
    │QueryEncoder│  │  Vector   │  │Reranker ││
    │   (GPU)    │  │  Store    │  │ (GPU)   ││
    │            │  │ (ChromaDB)│  │Cross-Enc││
    └────────────┘  └───────────┘  └─────────┘│
                                               │
                                    ┌──────────▼──────┐
                                    │   LLMClient     │
                                    │(calls external) │
                                    └─────────────────┘

Two-Stage Retrieval Flow:
1. QueryEncoder: Embed query → vector
2. VectorStore: Retrieve 45 candidates (high recall)
3. Reranker: Rerank to top 15 (high precision)
4. LLMClient: Generate answer from reranked context

Data Pipeline

S3 EPUBs → Parse → Chunk → Embed (GPU) → ChromaDB

Key components:

  • download_books(): Download from S3
  • parse_epub_content(): Extract text from EPUBs
  • Chunker: Split text into chunks
  • Embedder: Generate embeddings (GPU-accelerated)
  • ChromaWriter: Store in ChromaDB

RAG Service Microservices

The system consists of two independent Anyscale services:

  1. LLM Service (services/llm_service.yaml)

    • Runs Mistral-7B-Instruct-v0.3 model (upgraded from v0.1)
    • 32K token context window (4x larger than v0.1's 8K)
    • OpenAI-compatible API at /v1/chat/completions
    • Auto-scales based on load
  2. RAG QA Service (services/services/rag_service.yaml)

    • QueryEncoder: Encode user queries into embeddings (0.1 GPU per replica)
    • VectorStore: Query ChromaDB for relevant document chunks
    • Reranker: Rerank candidates using cross-encoder for precision (0.1 GPU per replica) [NEW]
    • Retriever: Orchestrate two-stage retrieval (vector search → reranking)
    • LLMClient: Call external Anyscale LLM service (or OpenAI)
    • QA: Combine retrieval + generation workflow
    • QAGateway (port 8000): FastAPI HTTP interface for user queries

    Two-Stage Retrieval: By default, retrieves 45 candidates with vector search, then reranks to select the top 15 most relevant chunks. This gives better precision than single-stage retrieval while keeping context manageable.

Quick Start

Prerequisites

  • Anyscale CLI installed: pip install anyscale
  • Anyscale workspace with:
    • GPU quota (for embeddings and LLM)
    • Cluster storage access (for ChromaDB)
  • uv package manager: curl -LsSf https://astral.sh/uv/install.sh | sh

Deployment Options

You can deploy this system in three ways:

  1. Anyscale Job + Services (Recommended) - Job for ingestion, persistent services
  2. Workspace + Services - Manual ingestion in workspace, persistent services
  3. Local Development - Everything runs in your workspace session

Option 1: Anyscale Job + Services (Recommended)

This is the recommended approach for production deployments. You don't need an active workspace!

1. Run Ingestion Job

Submit the ingestion job (no workspace needed):

anyscale job submit -f services/ingestion_job.yaml

This job will:

  • Download EPUBs from S3
  • Generate embeddings (GPU-accelerated)
  • Create ChromaDB vector store
  • Automatically upload to $ANYSCALE_ARTIFACT_STORAGE/vector_store.tar.gz

The job uses Anyscale's built-in ANYSCALE_ARTIFACT_STORAGE environment variable (automatically set).

Monitor the job:

anyscale job logs <job-id> --follow

When complete, the job will print the S3 path where the vector store was uploaded.

2. Deploy Services

Once ingestion completes, deploy the services from anywhere (no workspace needed):

# 1. Deploy LLM service
anyscale service deploy -f services/llm_service.yaml

# 2. Get LLM service URL and update services/rag_service.yaml:
#    env_vars:
#      LLM_SERVICE_BASE_URL: "https://your-llm-url.com/v1"
#      LLM_API_KEY: "your-token"
#      VECTOR_STORE_S3_PATH: "s3://your-bucket/artifacts/vector_store.tar.gz"

# 3. Deploy RAG service (downloads vector store from S3 at startup)
anyscale service deploy -f services/rag_service.yaml

Key Benefits:

  • ✅ No workspace needed - run from anywhere
  • ✅ Automatic S3 upload/download
  • ✅ Isolated ingestion job (auto-terminates when done)
  • ✅ Services download vector store from S3

Option 2: Workspace + Services (Manual)

Run ingestion in a workspace, then deploy as services.

0. Install Dependencies

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install project dependencies
uv sync

1. Ingest Data

From your Anyscale workspace:

# Run ingestion (automatically uploads to S3)
uv run main.py ingest

This creates the ChromaDB vector store and uploads it to $ANYSCALE_ARTIFACT_STORAGE/vector_store.tar.gz.

Note: ANYSCALE_ARTIFACT_STORAGE is automatically set in Anyscale workspaces.

2. Deploy Services

Use the automated deployment script (handles S3 upload automatically):

# Deploy both LLM and RAG services
./scripts/deploy_anyscale.sh both

The script will automatically:

  1. Deploy the LLM service
  2. Upload the vector store to S3 ($ANYSCALE_ARTIFACT_STORAGE)
  3. Deploy the RAG service (which downloads from S3 at startup)

Or deploy individually:

# Deploy LLM service only
./scripts/deploy_anyscale.sh llm

# Get LLM service credentials from Anyscale console and update services/rag_service.yaml

# Deploy RAG service (uploads vector store to S3 first)
./scripts/deploy_anyscale.sh rag

3. Query the Service

Get your service credentials from Anyscale console, then:

# Option A: Pass credentials as command-line arguments
uv run main.py query "What is the main theme?" \
  --service-url https://YOUR_RAG_SERVICE_URL/rag \
  --auth-token YOUR_SERVICE_TOKEN \
  --auth-version YOUR_SERVICE_VERSION

# Option B: Set environment variables (recommended)
export RAG_SERVICE_URL="https://YOUR_RAG_SERVICE_URL/rag"
export ANYSCALE_SERVICE_TOKEN="YOUR_SERVICE_TOKEN"
export ANYSCALE_SERVICE_VERSION="YOUR_SERVICE_VERSION"

# Option C: Use a .env file (even better!)
cp .env.example .env
# Edit .env with your credentials, then:
uv run main.py query "What is the main theme?"

# Then query without repeating credentials
uv run main.py query "What is the main theme?"

# Interactive mode
uv run main.py query --interactive

# With more context
uv run main.py query "What is the main theme?" --top-k 20

Environment Variables:

  • RAG_SERVICE_URL: Your deployed RAG service URL (default: http://localhost:8000/rag)
  • ANYSCALE_SERVICE_TOKEN: Bearer token for authentication
  • ANYSCALE_SERVICE_VERSION: Service version header

Note: The first query after a fresh deployment triggers the RAG to retrieve the embeddings. This takes a few minutes, so the query is likely to time out on the client's side. Subsequent queries should work without issue.

Tip: Copy .env.example to .env and configure your credentials there. The script will automatically load them.

Option 3: Local Development

For development and testing, run services locally in your workspace:

1. Deploy LLM Service (Anyscale)

anyscale service deploy -f services/llm_service.yaml

Get service URL and add to .env:

echo "LLM_SERVICE_BASE_URL=https://your-llm-service-url.com/v1" > .env
echo "LLM_API_KEY=your-token" >> .env

2. Ingest Data (Local Only - No S3 Upload)

# Skip S3 upload for local development
uv run main.py ingest --no-upload

3. Deploy RAG Locally

uv run main.py deploy

4. Query (Local)

# Single query
uv run main.py query "What is the main theme of the books?"

# Interactive mode
uv run main.py query --interactive

This uses the local RAG service at http://localhost:8000/rag.

Deployment Guide

Automated Deployment (Recommended)

Use the deployment script:

# Deploy both services
./scripts/deploy_anyscale.sh both

# Or deploy individually
./scripts/deploy_anyscale.sh llm   # Deploy LLM only
./scripts/deploy_anyscale.sh rag   # Deploy RAG only

Manual Deployment

Step 1: Deploy LLM Service

anyscale service deploy -f services/llm_service.yaml

Wait 5-10 minutes for the service to:

  • Download Mistral-7B model (~14GB)
  • Load model into GPU memory
  • Initialize vLLM engine

Get service credentials:

anyscale service list

Test the LLM service:

curl -H "Authorization: Bearer YOUR_TOKEN" \
     -H "X-ANYSCALE-VERSION: YOUR_VERSION" \
     -H "Content-Type: application/json" \
     https://YOUR_SERVICE_URL/v1/chat/completions \
     -d '{"messages": [{"role": "user", "content": "Hello!"}], "model": "mistral-7b-instruct", "max_tokens": 50}'

Step 2: Run Data Ingestion

Before deploying RAG service, ingest your data:

# From your workspace
uv run main.py ingest

# Or with custom configuration
uv run main.py ingest --s3-path s3://your-bucket/path --chunk-size 4096

This creates the ChromaDB vector store at /mnt/cluster_storage/vector_store.

Verify ingestion:

ls -lh /mnt/cluster_storage/vector_store

Upload vector store to S3:

The vector store needs to be uploaded to S3 so the Anyscale service can access it:

# Create tarball
cd /mnt/cluster_storage
tar -czf vector_store.tar.gz vector_store/

# Upload to Anyscale artifact storage
aws s3 cp vector_store.tar.gz $ANYSCALE_ARTIFACT_STORAGE/vector_store.tar.gz

The RAG service will automatically download and extract this at startup.

Step 3: Configure RAG Service

Update services/rag_service.yaml with your LLM service credentials:

env_vars:
  LLM_SERVICE_BASE_URL: "https://your-llm-service-url.com/v1"
  LLM_API_KEY: "your-token"
  LLM_MODEL: "mistral-7b-instruct"

Alternatively, set these in .env file:

LLM_SERVICE_BASE_URL=https://your-llm-service-url.com/v1
LLM_API_KEY=your-token
LLM_MODEL=mistral-7b-instruct

Step 4: Deploy RAG Service

anyscale service deploy -f services/services/rag_service.yaml

This takes 2-3 minutes to:

  • Download embedding model
  • Initialize ChromaDB connection
  • Connect to LLM service
  • Start autoscaling replicas

Get service URL:

anyscale service list

Step 5: Test RAG Service

Health check:

curl -H "Authorization: Bearer YOUR_RAG_TOKEN" \
     -H "X-ANYSCALE-VERSION: YOUR_RAG_VERSION" \
     https://YOUR_RAG_SERVICE_URL/rag/

Query the RAG:

uv run main.py query "What is the main theme?" --service-url https://YOUR_RAG_SERVICE_URL/rag --top-k 15

Service Management

View Service Status

anyscale service list
anyscale service status rag-qa-service
anyscale service status rag-llm-service

View Logs

# RAG service logs
anyscale service logs rag-qa-service

# LLM service logs
anyscale service logs rag-llm-service

# Follow logs in real-time
anyscale service logs rag-qa-service --follow

Update Service

After making code changes:

# Update RAG service
anyscale service deploy -f services/rag_service.yaml

# Or force rollout
anyscale service rollout rag-qa-service

Scale Service

Edit the YAML file's autoscaling configuration:

# In deploy_rag.py, update deployment decorators:
@serve.deployment(
    autoscaling_config={"min_replicas": 2, "max_replicas": 10},
)

Then redeploy.

Delete Service

anyscale service terminate rag-qa-service
anyscale service terminate rag-llm-service

Configuration

Current Configuration (Optimized for Two-Stage Retrieval)

Chunking Parameters (data_pipeline.py)

CHUNK_SIZE = 4096        # 2x original size (was 2048)
CHUNK_OVERLAP = 1536     # 37% overlap (was 10%)

Impact:

  • Each chunk: ~4,000 characters (2-3 pages of text)
  • Overlap: ~1,500 characters shared between adjacent chunks
  • Result: Maximum context continuity, minimal information loss at boundaries

Retrieval Parameters (deploy_rag.py)

# Two-stage retrieval (default)
top_k = 15           # Final chunks after reranking
retrieval_k = 45     # Candidates before reranking (3x top_k)
use_reranking = True # Enable cross-encoder reranking

Impact with Two-Stage Retrieval:

  • Stage 1 (Vector Search): Retrieves 45 candidates (high recall)
  • Stage 2 (Reranking): Selects 15 most relevant (high precision)
  • Final context: ~60,000 characters (15 chunks × 4096 chars)
  • Equivalent to: ~30 pages of book text per query
  • Better quality than single-stage retrieval with same context size
  • Users can increase to top_k=20-30 with the larger 32K context window

Configuration Comparison

Metric Original Before Reranking Current (with Reranking)
Chunk size 2,048 4,096 4,096
Overlap % 10% 37% 37%
Retrieval method Single-stage Single-stage Two-stage (retrieve + rerank)
Candidates retrieved 3 15 45
Final top_k 3 15 15
Reranking No No Yes (cross-encoder)
Default context ~6K chars ~60K chars ~60K chars
LLM context window 8K tokens 8K tokens 32K tokens (Mistral-v0.3)
Max usable top_k ~3-5 ~10-15 20-30+
Quality Baseline Better Best

LLM Service Configuration

The RAG service requires an external LLM service. You have two options:

Option 1: Anyscale LLM Service (Recommended)

Deploy the included Mistral-7B-Instruct-v0.3 service with 32K context window:

anyscale service deploy -f services/llm_service.yaml

This upgraded model (v0.3) provides:

  • 4x larger context than v0.1 (32K vs 8K tokens)
  • Support for higher top_k values (20-30+ chunks)
  • Better handling of long contexts

Then configure the service URL in .env:

echo "LLM_SERVICE_BASE_URL=https://your-service-url.anyscale.com" > .env

Option 2: OpenAI

Use OpenAI's API instead:

echo "OPENAI_API_KEY=sk-your-key" > .env
echo "LLM_MODEL=gpt-3.5-turbo" >> .env  # 16K context
# Or use gpt-4-turbo for 128K context

Data Pipeline Configuration

Edit data_pipeline.py constants or use command-line arguments:

S3_BUCKET_PATH = "s3://your-bucket"
EMBEDDER_MODEL = "intfloat/multilingual-e5-large-instruct"
CHUNK_SIZE = 4096
CHUNK_OVERLAP = 1536

Custom Configuration Examples

Custom Chunk Size

uv run python main.py ingest --chunk-size 4096

Different Embedding Model

uv run python main.py ingest --embedding-model BAAI/bge-large-en-v1.5

Query with Two-Stage Retrieval (Default)

# Default: retrieve 45 candidates, rerank to top 15
uv run python main.py query "What happened?" --top-k 15

Query with Custom Retrieval Settings

# Retrieve 60 candidates, rerank to top 20 (uses more context)
uv run python main.py query "Who is Rand?" --top-k 20 --retrieval-k 60

# Disable reranking for faster (but lower quality) responses
uv run python main.py query "Quick test" --no-reranking --top-k 10

Direct API Usage with Reranking

# Two-stage retrieval via HTTP
curl "http://localhost:8000/rag/answer?query=What%20is%20magic?&top_k=15&retrieval_k=45&use_reranking=true"

# Single-stage retrieval (faster, lower quality)
curl "http://localhost:8000/rag/answer?query=Quick%20test&top_k=10&use_reranking=false"

Performance Tuning

Improvements Implemented

The RAG system has been optimized from the original baseline configuration:

1. Enhanced Chunking Strategy

Before:

  • Chunk size: 2048 characters
  • Overlap: 200 characters (~10%)
  • Total chunks: 54,964

After:

  • Chunk size: 4096 characters (+100% larger)
  • Overlap: 1536 characters (~37%)
  • Total chunks: ~9,000-11,000 (fewer but richer chunks)

Benefits:

  • Each chunk contains significantly more context
  • 37% overlap ensures important information isn't split
  • Adjacent chunks share substantial context for continuity
  • Better chance of capturing complete thoughts/descriptions

2. Increased Retrieval Context

Before:

  • Default top_k: 3 chunks
  • Average context: ~6,000 characters

After:

  • Default top_k: 15 chunks
  • Average context: ~60,000 characters

Benefits:

  • LLM receives 10x more context
  • Higher chance of finding descriptive passages
  • Can synthesize information across multiple mentions
  • Better handles complex/multi-faceted questions

3. Two-Stage Retrieval with Reranking

Before (Single-Stage):

  • Vector search retrieves top_k=15 chunks directly
  • Uses bi-encoder similarity (cosine distance)
  • Fast but may miss relevant chunks ranked 16-45

After (Two-Stage):

  • Stage 1: Vector search retrieves 45 candidates (3x more)
  • Stage 2: Cross-encoder reranks to select top 15 most relevant
  • Uses ms-marco-MiniLM-L-6-v2 cross-encoder

Benefits:

  • Higher recall: Casts wider net in stage 1 (45 vs 15)
  • Higher precision: Cross-encoder is more accurate than bi-encoder
  • Better relevance: Cross-encoder encodes query+document together
  • Same context size: Still uses 15 chunks, but they're the BEST 15
  • Marginal overhead: Reranking 45 chunks takes ~50-100ms

Why cross-encoders are better:

  • Bi-encoders (vector search): Encode query and document separately, then compare
  • Cross-encoders (reranker): Encode query+document together, can capture interactions
  • Cross-encoders are slower but much more accurate for ranking

4. Upgraded LLM Model (Mistral-7B-v0.3)

Before:

  • Model: Mistral-7B-Instruct-v0.1
  • Context window: 8,192 tokens (~32K characters)
  • Max usable top_k: ~10-15 chunks

After:

  • Model: Mistral-7B-Instruct-v0.3
  • Context window: 32,768 tokens (~128K characters)
  • Max usable top_k: 20-30+ chunks

Benefits:

  • 4x larger context window allows more chunks without overflow
  • Can use higher top_k values for complex queries
  • Better handling of long contexts
  • No quality degradation with increased context

5. Improved Prompt Engineering

Before:

Given the following context from books:
{composed_context}

Answer the following question:
{query}

If you cannot provide an answer based on the context, please say "I don't know."
Do not use the term "context" in your response.

After:

You are answering questions about The Wheel of Time book series by Robert Jordan.

Context from the books:
{composed_context}

Question: {query}

Instructions:
- Answer based on the provided passages
- Synthesize information across multiple passages if needed
- If the passages mention but don't fully explain something, provide what information is available
- Include relevant details like names, places, relationships when mentioned
- Only say "I don't know" if the passages contain absolutely no relevant information
- Be concise but informative

Benefits:

  • Sets context (Wheel of Time series)
  • Encourages synthesis across passages
  • Allows partial answers when full info unavailable
  • Requests relevant details (names, places, relationships)
  • Less likely to say "I don't know" inappropriately

Optimal Usage Patterns

For Simple Factual Questions

# Default: retrieve 45 candidates, rerank to top 15
uv run main.py query "Who is Rand?" --service-url YOUR_SERVICE_URL

# Or with explicit parameters
uv run main.py query "Who is Rand?" --service-url YOUR_SERVICE_URL --top-k 15 --retrieval-k 45

For Complex Character Analysis

# Use 20-25 chunks for comprehensive coverage
# Retrieves 60-75 candidates, reranks to top 20-25
uv run main.py query "Describe Rand's character arc" --service-url YOUR_SERVICE_URL --top-k 25 --retrieval-k 75

For Cross-Book Comparisons

# Maximum context with 32K context window
# Retrieves 90 candidates, reranks to top 30
uv run main.py query "Compare Rand and Mat's leadership styles" --service-url YOUR_SERVICE_URL --top-k 30 --retrieval-k 90

For Speed-Critical Applications

# Disable reranking for faster responses (lower quality)
uv run main.py query "Quick answer" --service-url YOUR_SERVICE_URL --no-reranking --top-k 10

Expected Performance Improvements

Before All Improvements (Baseline)

Query: "Who is Tylin?"

Retrieved chunks: 3 small chunks (2048 chars each) mentioning Tylin LLM response: "I don't know."

Why it failed:

  • Chunks mentioned Tylin but didn't describe her
  • Too few chunks to synthesize information
  • Prompt demanded certainty
  • Small context window limited retrieval

After All Improvements (Current)

Query: "Who is Tylin?"

Retrieval process:

  1. Vector search retrieves 45 candidates (larger search space)
  2. Cross-encoder reranks to select 15 most relevant chunks
  3. Each chunk is 4096 chars with 37% overlap

LLM response: Should provide detailed information about Tylin from multiple contexts

Why it works better:

  • Two-stage retrieval: Higher chance of finding relevant chunks (45 candidates vs 3)
  • Reranking: Cross-encoder selects BEST 15, not just closest 15
  • Larger chunks: More complete context per chunk (4096 vs 2048)
  • More overlap: Better narrative continuity (37% vs 10%)
  • Larger LLM context: Can handle 30+ chunks if needed (32K vs 8K)
  • Better prompt: Synthesizes partial information

Performance Comparison

Metric Baseline After Chunking After Reranking With Larger Model
Chunks retrieved 3 15 45 → 15 45 → 30+
Chunk size 2048 4096 4096 4096
Retrieval method Single-stage Single-stage Two-stage Two-stage
Reranking No No Yes Yes
LLM context 8K tokens 8K tokens 32K tokens 32K tokens
Answer quality Poor Good Better Best
Latency Fast Fast +50-100ms +50-100ms

Data Ingestion Stats

Processing Pipeline

  1. Download 15 EPUB books from S3
  2. Parse ~8,000 pages of text
  3. Chunk into ~9,000-11,000 chunks (fewer but larger)
  4. Generate 1024-dim embeddings with multilingual-e5-large-instruct
  5. Store in ChromaDB with metadata (book name, page, source)

Estimated Resources

  • Processing time: ~15-20 minutes
  • GPU usage: ~10-15 minutes for embedding generation
  • ChromaDB size: ~500-600 MB
  • Total chunks: ~9,000-11,000

Future Enhancements

Potential improvements for even better performance:

  1. Hybrid search: Combine vector search + keyword search (BM25 + dense retrieval)
  2. Query expansion: Automatically expand queries for better retrieval
  3. Metadata filtering: Filter by specific books, characters, or topics
  4. Iterative retrieval: Retrieve, then retrieve again based on initial results
  5. Larger LLM: Use more capable model (Llama-3.1-70B, GPT-4, Claude)
  6. Fine-tuned embeddings: Train embeddings specifically for Wheel of Time
  7. Graph RAG: Build knowledge graph of character relationships and plot events
  8. Advanced reranking: Use larger cross-encoders (e.g., ms-marco-electra-base) or domain-specific rerankers

Already Implemented:

  • Two-stage retrieval with cross-encoder reranking
  • Larger context window (Mistral-7B-v0.3 with 32K tokens)

Cost Optimization

Minimize Costs

  1. Use fractional GPUs:

    ray_actor_options={"num_gpus": 0.1}
  2. Reduce autoscaling:

    autoscaling_config={"min_replicas": 1, "max_replicas": 2}
  3. Terminate when not in use:

    anyscale service terminate rag-qa-service
  4. Use smaller models:

    • Consider Mistral-7B instead of Llama-70B
    • Use quantized models (GPTQ, AWQ)

Monitor Costs

anyscale service metrics rag-qa-service
anyscale service metrics rag-llm-service

Troubleshooting

"ChromaDB collection not found"

Run data ingestion first:

uv run python main.py ingest

"LLM service not configured"

The RAG service requires an external LLM service to be configured:

  1. Deploy the Anyscale LLM service (recommended):

    anyscale service deploy -f services/llm_service.yaml

    Then add the service URL to .env:

    echo "LLM_SERVICE_BASE_URL=https://your-service-url.anyscale.com" > .env
  2. Or use OpenAI:

    echo "OPENAI_API_KEY=sk-your-key" > .env

Make sure the LLM service is deployed and configured before starting the RAG service.

Service not responding

Validate environment and test the service:

# Validate environment
uv run main.py validate

# Test a query
uv run main.py query "test" --service-url http://localhost:8000/rag

LLM Service Issues

Service not responding:

  • Check logs: anyscale service logs rag-llm-service
  • Look for OOM (Out of Memory) errors
  • Verify GPU quota and node availability

Model download timeout:

  • Service will retry automatically
  • Check network connectivity to HuggingFace
  • May need to increase timeout in service config

Deployment errors:

  • Verify you're using the correct image: anyscale/ray-llm:2.52.1-py311-cu128
  • Ensure serve_llm.py uses the correct LLMConfig API format with:
    • model_loading_config dict containing model_id and model_source
    • accelerator_type (not accelerator_type_or_device)
    • deployment_config with nested autoscaling_config
    • engine_kwargs (not engine_config)
    • build_openai_app({"llm_configs": [llm_config]}) format
  • Check that serve_llm.py uses dict() not {} for config parameters
  • Ensure the import path is serve_llm:app (not serve_llm:llm_app)
  • Check Anyscale console logs for specific errors

RAG Service Issues

"LLM service not configured" error:

  • Verify LLM_SERVICE_BASE_URL is set in services/rag_service.yaml
  • Test LLM service independently first
  • Check authentication credentials

"Collection not found" error:

  • Verify data ingestion completed: ls /mnt/cluster_storage/vector_store
  • Verify vector store uploaded to S3: Check $ANYSCALE_ARTIFACT_STORAGE/vector_store.tar.gz exists
  • Check VECTOR_STORE_S3_PATH in services/rag_service.yaml points to the correct S3 location
  • Re-run ingestion and upload if needed:
    uv run main.py ingest
    cd /mnt/cluster_storage && tar -czf vector_store.tar.gz vector_store/
    aws s3 cp vector_store.tar.gz $ANYSCALE_ARTIFACT_STORAGE/vector_store.tar.gz

Slow query responses:

  • Check if LLM service is under load
  • Increase max_replicas for autoscaling
  • Monitor with: anyscale service metrics rag-qa-service

Encoding errors:

  • Verify embedding model matches data ingestion
  • Check GPU availability for QueryEncoder deployment
  • Review logs for OOM errors

General Issues

"Prefix / is being used" error:

  • Services have conflicting route prefixes
  • Use /rag for RAG service, / for LLM service
  • Update route_prefix in YAML files

Resource quota exceeded:

  • Check workspace quota: anyscale workspace quota
  • Reduce max_replicas in autoscaling config
  • Use fractional GPUs (0.1 GPU) where possible

GPU memory errors during ingestion:

  • Reduce batch size in data_pipeline.py:
    EMBED_BATCH_SIZE = 400  # Reduce from 800

Import errors:

  • Ensure dependencies are installed:
    uv sync

"ModuleNotFoundError" in Ray workers:

  • If you see errors like ModuleNotFoundError: No module named 'chromadb' during data pipeline execution, ensure dependencies are installed:
    uv sync
  • For Ray clusters, you may need to specify a runtime environment

"AttributeError: module 'pyarrow' has no attribute 'Table'":

  • This indicates a corrupted PyArrow installation. Fix it by reinstalling:
    uv sync --reinstall-package pyarrow
  • Verify the fix:
    uv run python -c "import pyarrow as pa; print(pa.Table)"

Development

Project Setup

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies
uv sync

Dependencies

RAG Service (must match your cluster versions):

  • ray[serve]==2.52.1 (must match Anyscale cluster)
  • pyarrow==19.0.1 (required for Anyscale)
  • chromadb>=1.3.5
  • sentence-transformers>=5.1.2
  • fastapi>=0.115.0
  • openai>=1.58.0
  • ebooklib>=0.20
  • langchain-text-splitters>=1.0.0

LLM Service (uses pre-built Anyscale image):

  • Uses anyscale/ray-llm:2.52.1-py311-cu128 image
  • Ray Serve LLM and vLLM are pre-installed with compatible versions
  • No manual dependency installation required

See pyproject.toml for complete RAG service dependencies.

Run Tests

uv run python main.py test

Project Structure

.
├── main.py                 # Main orchestrator CLI
├── README.md               # This file
├── pyproject.toml          # Project dependencies
├── .env.example            # Environment variable template
│
├── src/                    # Main application code
│   ├── data_pipeline.py    # Data ingestion (EPUB → ChromaDB)
│   └── deploy_rag.py       # RAG service deployment
│
├── services/               # Service deployment configurations
│   ├── serve_llm.py        # LLM service definition
│   ├── llm_service.yaml    # Anyscale LLM service config
│   └── services/rag_service.yaml    # Anyscale RAG service config
│
├── scripts/                # Utility scripts
│   ├── deploy_anyscale.sh  # Automated deployment script
│   ├── inspect_chromadb.py # ChromaDB inspection tool
│   ├── test_search.py      # Search testing tool
│   └── debug_llm_response.py # LLM response debugging
│
├── tests/                  # Test files
│   └── test_deployment.py  # Service testing
│
├── utils/                  # Utility modules
│   └── init_logger.py      # Logging configuration
│
└── examples/               # Example code
    └── actors.py           # Ray actors example

Production Checklist

Before going to production:

  • Load test your services
  • Set appropriate autoscaling limits
  • Configure monitoring and alerting
  • Set up log aggregation
  • Document service URLs and credentials
  • Test failover scenarios
  • Implement rate limiting
  • Add authentication/authorization
  • Configure CORS if needed for web access
  • Set up CI/CD for automated deployments
  • Back up your ChromaDB data
  • Document disaster recovery procedures

Additional Resources

License

MIT

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

About

Demo project for PyData Eindhoven 2025

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published