Ray RAG service for Wheel of Time

A demo for my talk at PyData Eindhoven 2025. Retrieval-Augmented Generation (RAG) system that ingests EPUB files from S3, creates vector embeddings, and deploys a question-answering service using Ray Serve on Anyscale.

Based on the Ray e2e-rag example.

Overview

This project processes EPUB books (e.g., Wheel of Time series) and enables semantic question-answering using:

Ray Data for distributed data processing
ChromaDB for vector storage
Sentence Transformers for embeddings
Anyscale LLM Service for text generation (deployed separately)
Ray Serve for scalable microservices architecture

Architecture

System Overview

                    ┌─────────────────────────────┐
                    │  Anyscale LLM Service       │
                    │  (Mistral-7B-v0.3 + vLLM)  │
                    │  32K context window         │
                    │  Persistent, External       │
                    └────────────┬────────────────┘
                                 │ HTTP/OpenAI API
                    ┌────────────▼────────────────┐
User ──────────────►│     QAGateway (8000)        │
                    │    (FastAPI)                │
                    └────────────┬────────────────┘
                                 │
                    ┌────────────▼────────────────┐
                    │       QA Service             │
                    │   (orchestrates RAG)         │
                    └──────┬──────────────┬────────┘
                           │              │
                    ┌──────▼──────────┐   │
                    │    Retriever    │   │
                    │ (two-stage)     │   │
                    └──────┬──────────┘   │
                           │              │
           ┌───────────────┼──────────────┼─────┐
           │               │              │     │
    ┌──────▼─────┐  ┌─────▼─────┐  ┌────▼────┐│
    │QueryEncoder│  │  Vector   │  │Reranker ││
    │   (GPU)    │  │  Store    │  │ (GPU)   ││
    │            │  │ (ChromaDB)│  │Cross-Enc││
    └────────────┘  └───────────┘  └─────────┘│
                                               │
                                    ┌──────────▼──────┐
                                    │   LLMClient     │
                                    │(calls external) │
                                    └─────────────────┘

Two-Stage Retrieval Flow:
1. QueryEncoder: Embed query → vector
2. VectorStore: Retrieve 45 candidates (high recall)
3. Reranker: Rerank to top 15 (high precision)
4. LLMClient: Generate answer from reranked context

Data Pipeline

S3 EPUBs → Parse → Chunk → Embed (GPU) → ChromaDB

Key components:

download_books(): Download from S3
parse_epub_content(): Extract text from EPUBs
Chunker: Split text into chunks
Embedder: Generate embeddings (GPU-accelerated)
ChromaWriter: Store in ChromaDB

RAG Service Microservices

The system consists of two independent Anyscale services:

LLM Service (services/llm_service.yaml)
- Runs Mistral-7B-Instruct-v0.3 model (upgraded from v0.1)
- 32K token context window (4x larger than v0.1's 8K)
- OpenAI-compatible API at /v1/chat/completions
- Auto-scales based on load
RAG QA Service (services/services/rag_service.yaml)
- QueryEncoder: Encode user queries into embeddings (0.1 GPU per replica)
- VectorStore: Query ChromaDB for relevant document chunks
- Reranker: Rerank candidates using cross-encoder for precision (0.1 GPU per replica) [NEW]
- Retriever: Orchestrate two-stage retrieval (vector search → reranking)
- LLMClient: Call external Anyscale LLM service (or OpenAI)
- QA: Combine retrieval + generation workflow
- QAGateway (port 8000): FastAPI HTTP interface for user queries
Two-Stage Retrieval: By default, retrieves 45 candidates with vector search, then reranks to select the top 15 most relevant chunks. This gives better precision than single-stage retrieval while keeping context manageable.

Quick Start

Prerequisites

Anyscale CLI installed: pip install anyscale
Anyscale workspace with:
- GPU quota (for embeddings and LLM)
- Cluster storage access (for ChromaDB)
uv package manager: curl -LsSf https://astral.sh/uv/install.sh | sh

Deployment Options

You can deploy this system in three ways:

Anyscale Job + Services (Recommended) - Job for ingestion, persistent services
Workspace + Services - Manual ingestion in workspace, persistent services
Local Development - Everything runs in your workspace session

Option 1: Anyscale Job + Services (Recommended)

This is the recommended approach for production deployments. You don't need an active workspace!

1. Run Ingestion Job

Submit the ingestion job (no workspace needed):

anyscale job submit -f services/ingestion_job.yaml

This job will:

Download EPUBs from S3
Generate embeddings (GPU-accelerated)
Create ChromaDB vector store
Automatically upload to $ANYSCALE_ARTIFACT_STORAGE/vector_store.tar.gz

The job uses Anyscale's built-in ANYSCALE_ARTIFACT_STORAGE environment variable (automatically set).

Monitor the job:

anyscale job logs <job-id> --follow

When complete, the job will print the S3 path where the vector store was uploaded.

2. Deploy Services

Once ingestion completes, deploy the services from anywhere (no workspace needed):

# 1. Deploy LLM service
anyscale service deploy -f services/llm_service.yaml

# 2. Get LLM service URL and update services/rag_service.yaml:
#    env_vars:
#      LLM_SERVICE_BASE_URL: "https://your-llm-url.com/v1"
#      LLM_API_KEY: "your-token"
#      VECTOR_STORE_S3_PATH: "s3://your-bucket/artifacts/vector_store.tar.gz"

# 3. Deploy RAG service (downloads vector store from S3 at startup)
anyscale service deploy -f services/rag_service.yaml

Key Benefits:

✅ No workspace needed - run from anywhere
✅ Automatic S3 upload/download
✅ Isolated ingestion job (auto-terminates when done)
✅ Services download vector store from S3

Option 2: Workspace + Services (Manual)

Run ingestion in a workspace, then deploy as services.

0. Install Dependencies

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install project dependencies
uv sync

1. Ingest Data

From your Anyscale workspace:

# Run ingestion (automatically uploads to S3)
uv run main.py ingest

This creates the ChromaDB vector store and uploads it to $ANYSCALE_ARTIFACT_STORAGE/vector_store.tar.gz.

Note: ANYSCALE_ARTIFACT_STORAGE is automatically set in Anyscale workspaces.

2. Deploy Services

Use the automated deployment script (handles S3 upload automatically):

# Deploy both LLM and RAG services
./scripts/deploy_anyscale.sh both

The script will automatically:

Deploy the LLM service
Upload the vector store to S3 ($ANYSCALE_ARTIFACT_STORAGE)
Deploy the RAG service (which downloads from S3 at startup)

Or deploy individually:

# Deploy LLM service only
./scripts/deploy_anyscale.sh llm

# Get LLM service credentials from Anyscale console and update services/rag_service.yaml

# Deploy RAG service (uploads vector store to S3 first)
./scripts/deploy_anyscale.sh rag

3. Query the Service

Get your service credentials from Anyscale console, then:

# Option A: Pass credentials as command-line arguments
uv run main.py query "What is the main theme?" \
  --service-url https://YOUR_RAG_SERVICE_URL/rag \
  --auth-token YOUR_SERVICE_TOKEN \
  --auth-version YOUR_SERVICE_VERSION

# Option B: Set environment variables (recommended)
export RAG_SERVICE_URL="https://YOUR_RAG_SERVICE_URL/rag"
export ANYSCALE_SERVICE_TOKEN="YOUR_SERVICE_TOKEN"
export ANYSCALE_SERVICE_VERSION="YOUR_SERVICE_VERSION"

# Option C: Use a .env file (even better!)
cp .env.example .env
# Edit .env with your credentials, then:
uv run main.py query "What is the main theme?"

# Then query without repeating credentials
uv run main.py query "What is the main theme?"

# Interactive mode
uv run main.py query --interactive

# With more context
uv run main.py query "What is the main theme?" --top-k 20

Environment Variables:

RAG_SERVICE_URL: Your deployed RAG service URL (default: http://localhost:8000/rag)
ANYSCALE_SERVICE_TOKEN: Bearer token for authentication
ANYSCALE_SERVICE_VERSION: Service version header

Note: The first query after a fresh deployment triggers the RAG to retrieve the embeddings. This takes a few minutes, so the query is likely to time out on the client's side. Subsequent queries should work without issue.

Tip: Copy .env.example to .env and configure your credentials there. The script will automatically load them.

Option 3: Local Development

For development and testing, run services locally in your workspace:

1. Deploy LLM Service (Anyscale)

anyscale service deploy -f services/llm_service.yaml

Get service URL and add to .env:

echo "LLM_SERVICE_BASE_URL=https://your-llm-service-url.com/v1" > .env
echo "LLM_API_KEY=your-token" >> .env

2. Ingest Data (Local Only - No S3 Upload)

# Skip S3 upload for local development
uv run main.py ingest --no-upload

3. Deploy RAG Locally

uv run main.py deploy

4. Query (Local)

# Single query
uv run main.py query "What is the main theme of the books?"

# Interactive mode
uv run main.py query --interactive

This uses the local RAG service at http://localhost:8000/rag.

Deployment Guide

Automated Deployment (Recommended)

Use the deployment script:

# Deploy both services
./scripts/deploy_anyscale.sh both

# Or deploy individually
./scripts/deploy_anyscale.sh llm   # Deploy LLM only
./scripts/deploy_anyscale.sh rag   # Deploy RAG only

Manual Deployment

Step 1: Deploy LLM Service

anyscale service deploy -f services/llm_service.yaml

Wait 5-10 minutes for the service to:

Download Mistral-7B model (~14GB)
Load model into GPU memory
Initialize vLLM engine

Get service credentials:

anyscale service list

Test the LLM service:

curl -H "Authorization: Bearer YOUR_TOKEN" \
     -H "X-ANYSCALE-VERSION: YOUR_VERSION" \
     -H "Content-Type: application/json" \
     https://YOUR_SERVICE_URL/v1/chat/completions \
     -d '{"messages": [{"role": "user", "content": "Hello!"}], "model": "mistral-7b-instruct", "max_tokens": 50}'

Step 2: Run Data Ingestion

Before deploying RAG service, ingest your data:

# From your workspace
uv run main.py ingest

# Or with custom configuration
uv run main.py ingest --s3-path s3://your-bucket/path --chunk-size 4096

This creates the ChromaDB vector store at /mnt/cluster_storage/vector_store.

Verify ingestion:

ls -lh /mnt/cluster_storage/vector_store

Upload vector store to S3:

The vector store needs to be uploaded to S3 so the Anyscale service can access it:

# Create tarball
cd /mnt/cluster_storage
tar -czf vector_store.tar.gz vector_store/

# Upload to Anyscale artifact storage
aws s3 cp vector_store.tar.gz $ANYSCALE_ARTIFACT_STORAGE/vector_store.tar.gz

The RAG service will automatically download and extract this at startup.

Step 3: Configure RAG Service

Update services/rag_service.yaml with your LLM service credentials:

env_vars:
  LLM_SERVICE_BASE_URL: "https://your-llm-service-url.com/v1"
  LLM_API_KEY: "your-token"
  LLM_MODEL: "mistral-7b-instruct"

Alternatively, set these in .env file:

LLM_SERVICE_BASE_URL=https://your-llm-service-url.com/v1
LLM_API_KEY=your-token
LLM_MODEL=mistral-7b-instruct

Step 4: Deploy RAG Service

anyscale service deploy -f services/services/rag_service.yaml

This takes 2-3 minutes to:

Download embedding model
Initialize ChromaDB connection
Connect to LLM service
Start autoscaling replicas

Get service URL:

anyscale service list

Step 5: Test RAG Service

Health check:

curl -H "Authorization: Bearer YOUR_RAG_TOKEN" \
     -H "X-ANYSCALE-VERSION: YOUR_RAG_VERSION" \
     https://YOUR_RAG_SERVICE_URL/rag/

Query the RAG:

uv run main.py query "What is the main theme?" --service-url https://YOUR_RAG_SERVICE_URL/rag --top-k 15

Service Management

View Service Status

anyscale service list
anyscale service status rag-qa-service
anyscale service status rag-llm-service

View Logs

# RAG service logs
anyscale service logs rag-qa-service

# LLM service logs
anyscale service logs rag-llm-service

# Follow logs in real-time
anyscale service logs rag-qa-service --follow

Update Service

After making code changes:

# Update RAG service
anyscale service deploy -f services/rag_service.yaml

# Or force rollout
anyscale service rollout rag-qa-service

Scale Service

Edit the YAML file's autoscaling configuration:

# In deploy_rag.py, update deployment decorators:
@serve.deployment(
    autoscaling_config={"min_replicas": 2, "max_replicas": 10},
)

Then redeploy.

Delete Service

anyscale service terminate rag-qa-service
anyscale service terminate rag-llm-service

Configuration

Current Configuration (Optimized for Two-Stage Retrieval)

Chunking Parameters (data_pipeline.py)

CHUNK_SIZE = 4096        # 2x original size (was 2048)
CHUNK_OVERLAP = 1536     # 37% overlap (was 10%)

Impact:

Each chunk: ~4,000 characters (2-3 pages of text)
Overlap: ~1,500 characters shared between adjacent chunks
Result: Maximum context continuity, minimal information loss at boundaries

Retrieval Parameters (deploy_rag.py)

# Two-stage retrieval (default)
top_k = 15           # Final chunks after reranking
retrieval_k = 45     # Candidates before reranking (3x top_k)
use_reranking = True # Enable cross-encoder reranking

Impact with Two-Stage Retrieval:

Stage 1 (Vector Search): Retrieves 45 candidates (high recall)
Stage 2 (Reranking): Selects 15 most relevant (high precision)
Final context: ~60,000 characters (15 chunks × 4096 chars)
Equivalent to: ~30 pages of book text per query
Better quality than single-stage retrieval with same context size
Users can increase to top_k=20-30 with the larger 32K context window

Configuration Comparison

Metric	Original	Before Reranking	Current (with Reranking)
Chunk size	2,048	4,096	4,096
Overlap %	10%	37%	37%
Retrieval method	Single-stage	Single-stage	Two-stage (retrieve + rerank)
Candidates retrieved	3	15	45
Final top_k	3	15	15
Reranking	No	No	Yes (cross-encoder)
Default context	~6K chars	~60K chars	~60K chars
LLM context window	8K tokens	8K tokens	32K tokens (Mistral-v0.3)
Max usable top_k	~3-5	~10-15	20-30+
Quality	Baseline	Better	Best

LLM Service Configuration

The RAG service requires an external LLM service. You have two options:

Option 1: Anyscale LLM Service (Recommended)

Deploy the included Mistral-7B-Instruct-v0.3 service with 32K context window:

anyscale service deploy -f services/llm_service.yaml

This upgraded model (v0.3) provides:

4x larger context than v0.1 (32K vs 8K tokens)
Support for higher top_k values (20-30+ chunks)
Better handling of long contexts

Then configure the service URL in .env:

echo "LLM_SERVICE_BASE_URL=https://your-service-url.anyscale.com" > .env

Option 2: OpenAI

Use OpenAI's API instead:

echo "OPENAI_API_KEY=sk-your-key" > .env
echo "LLM_MODEL=gpt-3.5-turbo" >> .env  # 16K context
# Or use gpt-4-turbo for 128K context

Data Pipeline Configuration

Edit data_pipeline.py constants or use command-line arguments:

S3_BUCKET_PATH = "s3://your-bucket"
EMBEDDER_MODEL = "intfloat/multilingual-e5-large-instruct"
CHUNK_SIZE = 4096
CHUNK_OVERLAP = 1536

Custom Configuration Examples

Custom Chunk Size

uv run python main.py ingest --chunk-size 4096

Different Embedding Model

uv run python main.py ingest --embedding-model BAAI/bge-large-en-v1.5

Query with Two-Stage Retrieval (Default)

# Default: retrieve 45 candidates, rerank to top 15
uv run python main.py query "What happened?" --top-k 15

Query with Custom Retrieval Settings

# Retrieve 60 candidates, rerank to top 20 (uses more context)
uv run python main.py query "Who is Rand?" --top-k 20 --retrieval-k 60

# Disable reranking for faster (but lower quality) responses
uv run python main.py query "Quick test" --no-reranking --top-k 10

Direct API Usage with Reranking

# Two-stage retrieval via HTTP
curl "http://localhost:8000/rag/answer?query=What%20is%20magic?&top_k=15&retrieval_k=45&use_reranking=true"

# Single-stage retrieval (faster, lower quality)
curl "http://localhost:8000/rag/answer?query=Quick%20test&top_k=10&use_reranking=false"

Performance Tuning

Improvements Implemented

The RAG system has been optimized from the original baseline configuration:

1. Enhanced Chunking Strategy

Before:

Chunk size: 2048 characters
Overlap: 200 characters (~10%)
Total chunks: 54,964

After:

Chunk size: 4096 characters (+100% larger)
Overlap: 1536 characters (~37%)
Total chunks: ~9,000-11,000 (fewer but richer chunks)

Benefits:

Each chunk contains significantly more context
37% overlap ensures important information isn't split
Adjacent chunks share substantial context for continuity
Better chance of capturing complete thoughts/descriptions

2. Increased Retrieval Context

Before:

Default top_k: 3 chunks
Average context: ~6,000 characters

After:

Default top_k: 15 chunks
Average context: ~60,000 characters

Benefits:

LLM receives 10x more context
Higher chance of finding descriptive passages
Can synthesize information across multiple mentions
Better handles complex/multi-faceted questions

3. Two-Stage Retrieval with Reranking

Before (Single-Stage):

Vector search retrieves top_k=15 chunks directly
Uses bi-encoder similarity (cosine distance)
Fast but may miss relevant chunks ranked 16-45

After (Two-Stage):

Stage 1: Vector search retrieves 45 candidates (3x more)
Stage 2: Cross-encoder reranks to select top 15 most relevant
Uses ms-marco-MiniLM-L-6-v2 cross-encoder

Benefits:

Higher recall: Casts wider net in stage 1 (45 vs 15)
Higher precision: Cross-encoder is more accurate than bi-encoder
Better relevance: Cross-encoder encodes query+document together
Same context size: Still uses 15 chunks, but they're the BEST 15
Marginal overhead: Reranking 45 chunks takes ~50-100ms

Why cross-encoders are better:

Bi-encoders (vector search): Encode query and document separately, then compare
Cross-encoders (reranker): Encode query+document together, can capture interactions
Cross-encoders are slower but much more accurate for ranking

4. Upgraded LLM Model (Mistral-7B-v0.3)

Before:

Model: Mistral-7B-Instruct-v0.1
Context window: 8,192 tokens (~32K characters)
Max usable top_k: ~10-15 chunks

After:

Model: Mistral-7B-Instruct-v0.3
Context window: 32,768 tokens (~128K characters)
Max usable top_k: 20-30+ chunks

Benefits:

4x larger context window allows more chunks without overflow
Can use higher top_k values for complex queries
Better handling of long contexts
No quality degradation with increased context

5. Improved Prompt Engineering

Before:

Given the following context from books:
{composed_context}

Answer the following question:
{query}

If you cannot provide an answer based on the context, please say "I don't know."
Do not use the term "context" in your response.

After:

You are answering questions about The Wheel of Time book series by Robert Jordan.

Context from the books:
{composed_context}

Question: {query}

Instructions:
- Answer based on the provided passages
- Synthesize information across multiple passages if needed
- If the passages mention but don't fully explain something, provide what information is available
- Include relevant details like names, places, relationships when mentioned
- Only say "I don't know" if the passages contain absolutely no relevant information
- Be concise but informative

Benefits:

Sets context (Wheel of Time series)
Encourages synthesis across passages
Allows partial answers when full info unavailable
Requests relevant details (names, places, relationships)
Less likely to say "I don't know" inappropriately

Optimal Usage Patterns

For Simple Factual Questions

# Default: retrieve 45 candidates, rerank to top 15
uv run main.py query "Who is Rand?" --service-url YOUR_SERVICE_URL

# Or with explicit parameters
uv run main.py query "Who is Rand?" --service-url YOUR_SERVICE_URL --top-k 15 --retrieval-k 45

For Complex Character Analysis

# Use 20-25 chunks for comprehensive coverage
# Retrieves 60-75 candidates, reranks to top 20-25
uv run main.py query "Describe Rand's character arc" --service-url YOUR_SERVICE_URL --top-k 25 --retrieval-k 75

For Cross-Book Comparisons

# Maximum context with 32K context window
# Retrieves 90 candidates, reranks to top 30
uv run main.py query "Compare Rand and Mat's leadership styles" --service-url YOUR_SERVICE_URL --top-k 30 --retrieval-k 90

For Speed-Critical Applications

# Disable reranking for faster responses (lower quality)
uv run main.py query "Quick answer" --service-url YOUR_SERVICE_URL --no-reranking --top-k 10

Expected Performance Improvements

Before All Improvements (Baseline)

Query: "Who is Tylin?"

Retrieved chunks: 3 small chunks (2048 chars each) mentioning Tylin LLM response: "I don't know."

Why it failed:

Chunks mentioned Tylin but didn't describe her
Too few chunks to synthesize information
Prompt demanded certainty
Small context window limited retrieval

After All Improvements (Current)

Query: "Who is Tylin?"

Retrieval process:

Vector search retrieves 45 candidates (larger search space)
Cross-encoder reranks to select 15 most relevant chunks
Each chunk is 4096 chars with 37% overlap

LLM response: Should provide detailed information about Tylin from multiple contexts

Why it works better:

Two-stage retrieval: Higher chance of finding relevant chunks (45 candidates vs 3)
Reranking: Cross-encoder selects BEST 15, not just closest 15
Larger chunks: More complete context per chunk (4096 vs 2048)
More overlap: Better narrative continuity (37% vs 10%)
Larger LLM context: Can handle 30+ chunks if needed (32K vs 8K)
Better prompt: Synthesizes partial information

Performance Comparison

Metric	Baseline	After Chunking	After Reranking	With Larger Model
Chunks retrieved	3	15	45 → 15	45 → 30+
Chunk size	2048	4096	4096	4096
Retrieval method	Single-stage	Single-stage	Two-stage	Two-stage
Reranking	No	No	Yes	Yes
LLM context	8K tokens	8K tokens	32K tokens	32K tokens
Answer quality	Poor	Good	Better	Best
Latency	Fast	Fast	+50-100ms	+50-100ms

Data Ingestion Stats

Processing Pipeline

Download 15 EPUB books from S3
Parse ~8,000 pages of text
Chunk into ~9,000-11,000 chunks (fewer but larger)
Generate 1024-dim embeddings with multilingual-e5-large-instruct
Store in ChromaDB with metadata (book name, page, source)

Estimated Resources

Processing time: ~15-20 minutes
GPU usage: ~10-15 minutes for embedding generation
ChromaDB size: ~500-600 MB
Total chunks: ~9,000-11,000

Future Enhancements

Potential improvements for even better performance:

Hybrid search: Combine vector search + keyword search (BM25 + dense retrieval)
Query expansion: Automatically expand queries for better retrieval
Metadata filtering: Filter by specific books, characters, or topics
Iterative retrieval: Retrieve, then retrieve again based on initial results
Larger LLM: Use more capable model (Llama-3.1-70B, GPT-4, Claude)
Fine-tuned embeddings: Train embeddings specifically for Wheel of Time
Graph RAG: Build knowledge graph of character relationships and plot events
Advanced reranking: Use larger cross-encoders (e.g., ms-marco-electra-base) or domain-specific rerankers

Already Implemented:

✅ Two-stage retrieval with cross-encoder reranking
✅ Larger context window (Mistral-7B-v0.3 with 32K tokens)

Cost Optimization

Minimize Costs

Use fractional GPUs:
```
ray_actor_options={"num_gpus": 0.1}
```

Reduce autoscaling:

autoscaling_config={"min_replicas": 1, "max_replicas": 2}

Terminate when not in use:

anyscale service terminate rag-qa-service

Use smaller models:
- Consider Mistral-7B instead of Llama-70B
- Use quantized models (GPTQ, AWQ)

Monitor Costs

anyscale service metrics rag-qa-service
anyscale service metrics rag-llm-service

Troubleshooting

"ChromaDB collection not found"

Run data ingestion first:

uv run python main.py ingest

"LLM service not configured"

The RAG service requires an external LLM service to be configured:

Deploy the Anyscale LLM service (recommended):

anyscale service deploy -f services/llm_service.yaml

Then add the service URL to .env:

echo "LLM_SERVICE_BASE_URL=https://your-service-url.anyscale.com" > .env

Or use OpenAI:

echo "OPENAI_API_KEY=sk-your-key" > .env

Make sure the LLM service is deployed and configured before starting the RAG service.

Service not responding

Validate environment and test the service:

# Validate environment
uv run main.py validate

# Test a query
uv run main.py query "test" --service-url http://localhost:8000/rag

LLM Service Issues

Service not responding:

Check logs: anyscale service logs rag-llm-service
Look for OOM (Out of Memory) errors
Verify GPU quota and node availability

Model download timeout:

Service will retry automatically
Check network connectivity to HuggingFace
May need to increase timeout in service config

Deployment errors:

Verify you're using the correct image: anyscale/ray-llm:2.52.1-py311-cu128
Ensure serve_llm.py uses the correct LLMConfig API format with:
- model_loading_config dict containing model_id and model_source
- accelerator_type (not accelerator_type_or_device)
- deployment_config with nested autoscaling_config
- engine_kwargs (not engine_config)
- build_openai_app({"llm_configs": [llm_config]}) format
Check that serve_llm.py uses dict() not {} for config parameters
Ensure the import path is serve_llm:app (not serve_llm:llm_app)
Check Anyscale console logs for specific errors

RAG Service Issues

"LLM service not configured" error:

Verify LLM_SERVICE_BASE_URL is set in services/rag_service.yaml
Test LLM service independently first
Check authentication credentials

"Collection not found" error:

Verify data ingestion completed: ls /mnt/cluster_storage/vector_store
Verify vector store uploaded to S3: Check $ANYSCALE_ARTIFACT_STORAGE/vector_store.tar.gz exists
Check VECTOR_STORE_S3_PATH in services/rag_service.yaml points to the correct S3 location

Re-run ingestion and upload if needed:

uv run main.py ingest
cd /mnt/cluster_storage && tar -czf vector_store.tar.gz vector_store/
aws s3 cp vector_store.tar.gz $ANYSCALE_ARTIFACT_STORAGE/vector_store.tar.gz

Slow query responses:

Check if LLM service is under load
Increase max_replicas for autoscaling
Monitor with: anyscale service metrics rag-qa-service

Encoding errors:

Verify embedding model matches data ingestion
Check GPU availability for QueryEncoder deployment
Review logs for OOM errors

General Issues

"Prefix / is being used" error:

Services have conflicting route prefixes
Use /rag for RAG service, / for LLM service
Update route_prefix in YAML files

Resource quota exceeded:

Check workspace quota: anyscale workspace quota
Reduce max_replicas in autoscaling config
Use fractional GPUs (0.1 GPU) where possible

GPU memory errors during ingestion:

Reduce batch size in data_pipeline.py:

EMBED_BATCH_SIZE = 400  # Reduce from 800

Import errors:

Ensure dependencies are installed:
```
uv sync
```

"ModuleNotFoundError" in Ray workers:

If you see errors like ModuleNotFoundError: No module named 'chromadb' during data pipeline execution, ensure dependencies are installed:
```
uv sync
```
For Ray clusters, you may need to specify a runtime environment

"AttributeError: module 'pyarrow' has no attribute 'Table'":

This indicates a corrupted PyArrow installation. Fix it by reinstalling:
```
uv sync --reinstall-package pyarrow
```

Verify the fix:

uv run python -c "import pyarrow as pa; print(pa.Table)"

Development

Project Setup

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies
uv sync

Dependencies

RAG Service (must match your cluster versions):

ray[serve]==2.52.1 (must match Anyscale cluster)
pyarrow==19.0.1 (required for Anyscale)
chromadb>=1.3.5
sentence-transformers>=5.1.2
fastapi>=0.115.0
openai>=1.58.0
ebooklib>=0.20
langchain-text-splitters>=1.0.0

LLM Service (uses pre-built Anyscale image):

Uses anyscale/ray-llm:2.52.1-py311-cu128 image
Ray Serve LLM and vLLM are pre-installed with compatible versions
No manual dependency installation required

See pyproject.toml for complete RAG service dependencies.

Run Tests

uv run python main.py test

Project Structure

.
├── main.py                 # Main orchestrator CLI
├── README.md               # This file
├── pyproject.toml          # Project dependencies
├── .env.example            # Environment variable template
│
├── src/                    # Main application code
│   ├── data_pipeline.py    # Data ingestion (EPUB → ChromaDB)
│   └── deploy_rag.py       # RAG service deployment
│
├── services/               # Service deployment configurations
│   ├── serve_llm.py        # LLM service definition
│   ├── llm_service.yaml    # Anyscale LLM service config
│   └── services/rag_service.yaml    # Anyscale RAG service config
│
├── scripts/                # Utility scripts
│   ├── deploy_anyscale.sh  # Automated deployment script
│   ├── inspect_chromadb.py # ChromaDB inspection tool
│   ├── test_search.py      # Search testing tool
│   └── debug_llm_response.py # LLM response debugging
│
├── tests/                  # Test files
│   └── test_deployment.py  # Service testing
│
├── utils/                  # Utility modules
│   └── init_logger.py      # Logging configuration
│
└── examples/               # Example code
    └── actors.py           # Ray actors example

Production Checklist

Before going to production:

Additional Resources

License

MIT

Contributing

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
examples		examples
scripts		scripts
services		services
src		src
tests		tests
utils		utils
.anyscaleignore		.anyscaleignore
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

RCdeWit/ray-wheel-of-time

Folders and files

Latest commit

History

Repository files navigation

Ray RAG service for Wheel of Time

Table of Contents

Overview

Architecture

System Overview

Data Pipeline

RAG Service Microservices

Quick Start

Prerequisites

Deployment Options

Option 1: Anyscale Job + Services (Recommended)

1. Run Ingestion Job

2. Deploy Services

Option 2: Workspace + Services (Manual)

0. Install Dependencies

1. Ingest Data

2. Deploy Services

3. Query the Service

Option 3: Local Development

1. Deploy LLM Service (Anyscale)

2. Ingest Data (Local Only - No S3 Upload)

3. Deploy RAG Locally

4. Query (Local)

Deployment Guide

Automated Deployment (Recommended)

Manual Deployment

Step 1: Deploy LLM Service

Step 2: Run Data Ingestion

Step 3: Configure RAG Service

Step 4: Deploy RAG Service

Step 5: Test RAG Service

Service Management

View Service Status

View Logs

Update Service

Scale Service

Delete Service

Configuration

Current Configuration (Optimized for Two-Stage Retrieval)

Chunking Parameters (data_pipeline.py)

Retrieval Parameters (deploy_rag.py)

Configuration Comparison

LLM Service Configuration

Data Pipeline Configuration

Custom Configuration Examples

Custom Chunk Size

Different Embedding Model

Query with Two-Stage Retrieval (Default)

Query with Custom Retrieval Settings

Direct API Usage with Reranking

Performance Tuning

Improvements Implemented

1. Enhanced Chunking Strategy

2. Increased Retrieval Context

3. Two-Stage Retrieval with Reranking

4. Upgraded LLM Model (Mistral-7B-v0.3)

5. Improved Prompt Engineering

Optimal Usage Patterns

For Simple Factual Questions

For Complex Character Analysis

For Cross-Book Comparisons

For Speed-Critical Applications

Expected Performance Improvements

Before All Improvements (Baseline)

After All Improvements (Current)

Performance Comparison

Data Ingestion Stats

Processing Pipeline

Estimated Resources

Future Enhancements

Cost Optimization

Minimize Costs

Monitor Costs

Troubleshooting

"ChromaDB collection not found"

Packages