Skip to content

Intelligent RAG-based chatbot with intent classification, semantic search, and heuristic query optimization for academic assistance

Notifications You must be signed in to change notification settings

Nuraddin0/RAG-LLM-mini-chatbot

Repository files navigation

Marmara CSE RAG System (Python Version - Iteration 2)

Course: CSE3063 - Object-Oriented Analysis and Design
Term Project: Iteration 2 (Extensibility & Evaluation)
Language: Python 3.11+
Test Coverage: 88 passing tests

1. Overview

This is the Python implementation of a modular Retrieval-Augmented Generation (RAG) chatbot designed to answer questions about the Marmara University Computer Engineering department (staff, courses, policies).

Iteration 2 Highlights:

  • AI-Powered Answer Generation: Gemini API integration for natural language responses
  • Comprehensive Test Suite: 88 unit tests covering all major components
  • Evaluation Framework: Systematic testing with EvalHarness
  • Batch Processing: CLI mode for evaluating multiple queries with performance metrics
  • Multiple Reranking Strategies: Jaccard similarity, Cosine similarity, and Simple proximity-based
  • Production-Ready Logging: JSONL trace logs for debugging and analysis

Core Architecture Components:

Component Implementation Description
Controller ChatBot GRASP Controller - orchestrates the full pipeline
Intent Detection RuleIntentDetector Rule-based keyword matching
Query Writing HeuristicQueryWriter Stopword filtering & intent boosting
Retrieval KeywordRetriever TF-based keyword retrieval
Reranking Multiple Strategies Jaccard, Cosine, Simple proximity
Answer Generation GeminiAnswerAgent AI-powered contextual responses

2. Directory Structure

The project is designed to run self-contained from the root directory.

.
├── main.py                           # Entry Point (Single & Batch Modes)
├── config.yaml                       # Main Configuration File
├── chunks.json                       # Document Data Store
├── index.json                        # Search Index
├── requirements.txt                  # Python Dependencies
├── env.env                           # Environment Variables (API Key) - DO NOT COMMIT
├── env.env.template                  # Template for env.env
├── eval_queries.json                 # Sample Evaluation Queries
├── CSE3063F25_Grp15_Iter2_7_CLI_output.txt  # Persistent Output Log
├── README.md                         # This file
├── ENV_SETUP_GUIDE.md                # Environment Setup Guide
│
├── evaluation/                       # Evaluation Results
│   ├── eval_results_*.json          # Individual query results
│   └── eval_report_*.json           # Aggregate metrics
│
├── logs/                             # Execution trace logs (.jsonl)
│   └── run-*.jsonl                  # Timestamped trace logs
│
├── config/                           # Configuration Loading & Data Structures
│   ├── __init__.py
│   ├── app_config.py                 # Application Configuration Class
│   └── config_loader.py              # YAML Config Parser
│
├── entities/                         # Domain Objects
│   ├── __init__.py
│   ├── answer.py                     # Answer Entity
│   ├── chunk.py                      # Document Chunk
│   ├── context.py                    # Pipeline Context
│   ├── hit.py                        # Retrieval Hit
│   ├── intent.py                     # Intent Enum
│   ├── eval_query.py                 # Evaluation Query Entity
│   └── eval_result.py                # Evaluation Result Entity
│
├── helpers/                          # Controllers & Utilities
│   ├── __init__.py
│   ├── output_writer.py              # File Output Handler
│   ├── chat_bot.py                   # Main Pipeline Controller
│   ├── eval_harness.py               # Evaluation Framework
│   ├── batch_eval_runner.py          # Batch Evaluation Runner
│   ├── reranker_factory.py           # Reranker Factory Pattern
│   └── answer_agent_factory.py       # Answer Agent Factory Pattern
│
├── service_interfaces/               # Interfaces for Pipeline Stages (Strategy Pattern)
│   ├── __init__.py
│   ├── i_answer_agent.py             # Answer Generation Interface
│   ├── i_intent_detector.py          # Intent Detection Interface
│   ├── i_query_writer.py             # Query Writing Interface
│   ├── i_reranker.py                 # Reranking Interface
│   └── i_retriever.py                # Retrieval Interface
│
├── services/                         # Concrete Implementations of Strategies
│   ├── __init__.py
│   ├── heuristic_query_writer.py     # Stopword Filtering & Intent Boosting
│   ├── keyword_retriever.py          # TF-based Keyword Retrieval
│   ├── rule_intent_detector.py       # Rule-based Intent Detection
│   ├── simple_reranker.py            # Proximity-based Reranking
│   ├── jaccard_reranker.py           # Jaccard Similarity Reranker
│   ├── template_answer_agent.py      # Template-based Answer Generation
│   └── gemini_answer_agent.py        # Gemini AI Answer Agent
│
├── tests/                            # Unit Tests (88 tests)
│   ├── __init__.py
│   ├── test_answer_agent_factory.py  # Factory pattern tests (5 tests)
│   ├── test_gemini_answer_agent.py   # AI agent tests (23 tests)
│   ├── test_heuristic_query_writer.py # Query writer tests (6 tests)
│   ├── test_jaccard_reranker.py      # Jaccard reranker tests (13 tests)
│   ├── test_keyword_retriever.py     # Retriever tests (23 tests)
│   ├── test_reranker_factory.py      # Reranker factory tests (12 tests)
│   └── test_rule_intent_detector.py  # Intent detection tests (6 tests)
│
└── trace/                            # Observer Pattern for Logging
    ├── __init__.py
    ├── jsonl_trace_sink.py           # JSONL File Logger
    ├── trace_bus.py                  # Event Publisher
    ├── trace_event.py                # Trace Event Model
    └── trace_observer.py             # Observer Interface

3. Configuration Schema (config.yaml)

The application's logic is driven entirely by config.yaml. This fulfills the requirement for "Config-driven strategy selection."

config.yaml
├── strategies/             # Strategy Selection (Class Mapping)
│   ├── intentDetector      # "RuleBased"
│   ├── queryWriter         # "Heuristic"
│   ├── retriever           # "Keyword"
│   ├── reranker            # "jaccard" (simple, jaccard, cosine)
│   └── answerAgent         # "gemini" (gemini, template)

├── parameters/             # Algorithm Tuning & Logic
│   ├── retrieverK          # (int) Number of docs to fetch (default: 6)
│   ├── proximityBonus      # (int) Score bonus for close terms (default: 5)
│   ├── titleBoost          # (int) Score multiplier for titles (default: 3)
│   ├── proximityWindow     # (int) Max distance for proximity check (default: 15)
│   └── intentPriority/     # (List) Tie-breaking order
│       ├── StaffLookup
│       ├── Registration
│       ├── PolicyFAQ
│       └── Course

├── stopwords/              # (List) Common words to ignore
│   ├── "a", "about", "am", "an", "and"...
│   └── ... (70+ words)

└── intentRules/            # (Map) Knowledge Base for Detection & Boosting
    ├── StaffLookup/        # Keywords for staff queries
    │   ├── "professor", "staff", "instructor"
    │   ├── "office", "email", "contact"
    │   └── ...
    ├── Registration/       # Keywords for enrollment/admin
    │   ├── "enroll", "register", "deadline"
    │   └── ...
    ├── PolicyFAQ/          # Keywords for rules/exams
    │   ├── "regulation", "grade", "exam"
    │   └── ...
    └── Course/             # Keywords for curriculum
        ├── "credit", "syllabus", "prerequisite"
        └── ...

4. Commands (How to Run)

Prerequisites

  • Python 3.11 or higher installed (python --version)
  • Required libraries installed (pip install -r requirements.txt)
  • Google API Key set in env.env file (for Gemini AI)
  • The files main.py, config.yaml, chunks.json, and index.json must be in the same folder.

Quick Setup

1. Install Dependencies

pip install -r requirements.txt

2. Configure API Key

Create env.env file in project root:

# Copy template
cp env.env.template env.env

# Edit env.env and add your Google API key:
GOOGLE_API_KEY=your-actual-api-key-here

Get your API key from: https://aistudio.google.com/app/apikey

For detailed setup instructions, see ENV_SETUP_GUIDE.md

Single Query Mode

Run the application with a single question.

Syntax

python main.py --config config.yaml --q "<Your Question Here>"

Example

python main.py --config config.yaml --q "Who is Professor Ganiz?"

Batch Evaluation Mode

Run batch evaluation on multiple test queries.

Syntax

python main.py --config config.yaml --batch <query_file.json> --k <coverage_k>

Example

python main.py --config config.yaml --batch eval_queries.json --k 5

Output: Results saved to evaluation/ folder

Usage Examples

Scenario 1: Staff Lookup (Single Query)

Querying for a specific professor's details.

python main.py --config config.yaml --q "Who is Professor Ganiz?"

Expected Output:

Intent.StaffLookup
===============================
Professor Murat Can Ganiz is a faculty member in the Computer Engineering department.

Office: M2-123
Email: mganiz@marmara.edu.tr
Research Areas: Machine Learning, Natural Language Processing

SOURCES:
[1] staff.txt:section1:100-250
===============================

Scenario 2: Course Information (Single Query)

Querying for specific course prerequisites or credits.

python main.py --config config.yaml --q "How many credits does CSE3063 have?"

Expected Output:

Intent.Course
===============================
CSE3063 (Object-Oriented Analysis and Design) is a 4-credit course.

Prerequisites: CSE2034
Description: This course covers object-oriented programming principles...

SOURCES:
[1] courses.txt:section2:500-750
===============================

Scenario 3: Batch Evaluation

Running systematic evaluation on multiple test queries.

python main.py --config config.yaml --batch eval_queries.json --k 5

Expected Output:

Running batch evaluation from: eval_queries.json
K value for coverage@k: 5
Loaded 5 evaluation queries
Evaluated: Who is Professor Ganiz?... (Intent: True, Coverage@5: 1.00, Latency: 1245ms)
Evaluated: What is the office of Murat Can Ganiz?... (Intent: True, Coverage@5: 1.00, Latency: 1189ms)
...

================================================================================
EVALUATION REPORT
================================================================================

Total Queries Evaluated: 5
K Value (for coverage@k): 5

--------------------------------------------------------------------------------
INTENT ACCURACY
--------------------------------------------------------------------------------
  Accuracy: 100.00% (5/5)

--------------------------------------------------------------------------------
COVERAGE@5
--------------------------------------------------------------------------------
  Average:  80.00%
  Median:   100.00%
  Min:      0.00%
  Max:      100.00%

--------------------------------------------------------------------------------
LATENCY (milliseconds)
--------------------------------------------------------------------------------
  Average:  1234 ms
  Median:   1210 ms
  Min:      987 ms
  Max:      1456 ms

Results saved to: evaluation/eval_results_20251218-120000.json
Report saved to: evaluation/eval_report_20251218-120000.json
================================================================================

5. Evaluation Query Format

The batch evaluation mode expects a JSON file with the following structure:

[
  {
    "question": "Who is Professor Ganiz?",
    "expected_intent": "StaffLookup",
    "expected_docs": ["staff"],
    "expected_answer": null
  },
  {
    "question": "How many credits does CSE3063 have?",
    "expected_intent": "Course",
    "expected_docs": ["course_catalog"],
    "expected_answer": null
  }
]

Field Descriptions:

  • question: The test question to evaluate
  • expected_intent: Expected intent classification (StaffLookup, Course, Registration, PolicyFAQ, Unknown)
  • expected_docs: List of expected relevant document IDs
  • expected_answer: (Optional) Expected answer text for accuracy evaluation

6. Evaluation Metrics

The EvalHarness calculates the following metrics:

Intent Accuracy

Percentage of queries where the detected intent matches the expected intent.

Intent Accuracy = (Correct Intent Classifications) / (Total Queries)

Coverage@k

Measures how many of the expected relevant documents appear in the top-k retrieved results.

Coverage@k = (Expected Docs in Top-k) / (Total Expected Docs)

Latency

Time taken (in milliseconds) to process the entire RAG pipeline for a query.

  • Average, median, min, and max latency are reported

Per-Intent Breakdown

All metrics are also computed per intent type for detailed analysis.

7. Iteration 2 New Features

7.1 Gemini AI Answer Agent

The system uses Google's Gemini API for natural language answer generation.

Benefits:

  • Natural, contextual answers
  • Better understanding of complex queries
  • Citation integration with source references

Configuration: Create env.env file in project root with your API key:

GOOGLE_API_KEY=your-api-key-here

The system automatically loads the API key from env.env at startup. See ENV_SETUP_GUIDE.md for detailed instructions.

7.2 Multiple Reranking Strategies

Choose from different reranking algorithms via config.yaml:

  • simple: Proximity-based scoring (default)
  • jaccard: Jaccard similarity coefficient
  • cosine: Cosine similarity with TF-IDF vectors

Example configuration:

strategies:
  reranker: "jaccard"  # or "simple", "cosine"

7.3 Batch Evaluation Framework

The EvalHarness provides systematic testing capabilities:

Features:

  • Load test queries from JSON
  • Run full pipeline for each query
  • Calculate performance metrics
  • Generate detailed reports
  • Export results for analysis

Output Files (saved in evaluation/ folder):

  • evaluation/eval_results_<timestamp>.json: Individual query results
  • evaluation/eval_report_<timestamp>.json: Aggregate metrics and statistics

7.4 Trace Logging

All pipeline stages are logged to JSONL files in the logs/ directory for debugging and analysis.

8. Installation

Step 1: Clone or Download the Project

Ensure all project files are in the same directory.

Step 2: Install Python Dependencies

pip install -r requirements.txt

Dependencies include:

  • PyYAML>=6.0: Configuration file parsing
  • google-generativeai>=0.3.0: Gemini API integration
  • python-dotenv>=0.19.0: Environment variable management

Step 3: Set Up Google API Key

Important: The system requires a Google API key for the Gemini AI answer agent.

Get Your API Key

  1. Visit: https://aistudio.google.com/app/apikey
  2. Sign in with your Google account
  3. Click "Create API key" or "Get API key"
  4. Copy the generated key

Configure in env.env File

  1. Copy the template file:
cp env.env.template env.env
  1. Edit env.env and add your API key:
GOOGLE_API_KEY=your-actual-api-key-here
  1. The system will automatically load the key from env.env at startup.

For detailed setup instructions, see: ENV_SETUP_GUIDE.md

Step 4: Verify Installation

python --version  # Should show 3.11 or higher
python -c "import yaml, google.generativeai"  # Test imports

Step 5: Run a Test Query

python main.py --config config.yaml --q "Who is Professor Ganiz?"

Step 6: (Optional) Run Batch Evaluation

python main.py --config config.yaml --batch eval_queries.json --k 5

9. Pipeline Architecture

The RAG pipeline follows a clear 5-stage flow, orchestrated by ChatBot:

User Question
     ↓
[1] Intent Detection (RuleIntentDetector)
     ↓ Intent
[2] Query Writing (HeuristicQueryWriter)
     ↓ Search Terms
[3] Retrieval (KeywordRetriever)
     ↓ Top-K Hits
[4] Reranking (Multiple Strategies Available)
     ↓ Ranked Hits
[5] Answer Generation (GeminiAnswerAgent)
     ↓
Final Answer with Citations

Design Patterns Used

  • Strategy Pattern: All pipeline stages implement interfaces (I*) for easy swapping
  • Factory Pattern: RerankerFactory and AnswerAgentFactory create instances based on config
  • Observer Pattern: TraceBus publishes events to TraceObservers for logging
  • Controller Pattern: ChatBot coordinates the pipeline (GRASP)
  • Information Expert: Each entity knows its own data and operations (GRASP)

Stage Details

1. Intent Detection

  • Class: RuleIntentDetector
  • Logic: Keyword matching against intentRules in config
  • Output: One of: StaffLookup, CourseLookup, RegulationLookup, GeneralInquiry

2. Query Writing

  • Class: HeuristicQueryWriter
  • Logic:
    • Remove stopwords from user question
    • Add intent-specific booster terms
    • Validates input (raises ValueError for None parameters)
  • Output: List of optimized search terms

3. Retrieval

  • Class: KeywordRetriever
  • Logic: Term Frequency (TF) lookup in inverted index
  • Parameters: retrieverK (number of documents to fetch)
  • Output: Top-K document chunks

4. Reranking

  • Class: SimpleReranker
  • Logic:
    • Proximity scoring (terms close together score higher)
    • Title boost (terms in title get bonus)
  • Parameters: proximityWindow, proximityBonus, titleBoost
  • Output: Ranked list of hits with scores

5. Answer Generation

  • Class: TemplateAnswerAgent
  • Logic: Format top-ranked chunk into readable answer with citations
  • Output: Formatted answer string

10. Design Patterns & Principles

GRASP Patterns

  • Controller: ChatBot manages the entire pipeline flow
  • Information Expert: Each service knows its own domain (e.g., KeywordRetriever knows how to search)
  • Low Coupling: Services depend only on interfaces, not concrete classes
  • High Cohesion: Each class has a single, well-defined responsibility

SOLID Principles

  • Single Responsibility: Each service handles one pipeline stage
  • Open/Closed: New strategies can be added without modifying existing code
  • Liskov Substitution: Any implementation of an interface can replace another
  • Interface Segregation: Small, focused interfaces (e.g., IIntentDetector)
  • Dependency Inversion: High-level orchestrator depends on abstractions (interfaces)

Design Patterns Used

  • Strategy Pattern: Pluggable algorithms via service interfaces
  • Observer Pattern: Trace logging with TraceBus and observers
  • Factory Pattern: Configuration-driven service instantiation

11. Output & Logging

Answer Output

Results are appended to CSE3063F25_Grp15_Iter1_7_2CLI_output.txt:

[2025-11-27 11:14:21] Q: Where is Alkaya?
A: Name: Ali Fuat ALKAYA
   Office: M2-249
   ...

Trace Logs

Detailed execution logs are written to logs/run-YYYYMMDD-HHMMSS.jsonl:

{"timestamp": "2025-11-27T08:14:21.992447Z", "stage": "IntentDetector", "input": "where is alkaya", "output": "StaffLookup", "durationMs": 0}
{"timestamp": "2025-11-27T08:14:21.992447Z", "stage": "QueryWriter", "input": "Intent: StaffLookup", "output": "['alkaya', 'staff', ...]", "durationMs": 0}
{"timestamp": "2025-11-27T08:14:21.993447Z", "stage": "Retriever", "input": ["alkaya", "staff", ...], "output": "Hits found: 6", "durationMs": 0}
{"timestamp": "2025-11-27T08:14:21.993447Z", "stage": "Reranker", "input": "Input Hits: 6", "output": "Top Score: 170.0", "durationMs": 0}
{"timestamp": "2025-11-27T08:14:21.993447Z", "stage": "AnswerAgent", "input": "Top Hit: staff", "output": "Name: Ali Fuat ALKAYA...", "durationMs": 0}

Each log entry contains:

  • timestamp: ISO 8601 format
  • stage: Pipeline stage name
  • input: Input to the stage
  • output: Output from the stage
  • durationMs: Execution time in milliseconds

12. Differences from Java Version

While maintaining the same architecture and logic, the Python implementation uses language-appropriate idioms:

  • Data Structures: Python dict, list, set instead of Java collections
  • Enums: Python enum.Enum instead of Java enums
  • Interfaces: Python ABC (Abstract Base Classes) instead of Java interfaces
  • File I/O: Python's open() with context managers instead of Java's BufferedReader/Writer
  • Configuration: PyYAML library instead of Jackson
  • JSON: Python's built-in json module instead of Jackson
  • Exceptions: ValueError/TypeError for validation instead of custom exceptions
  • Type Hints: Python type annotations for better code documentation

13. Troubleshooting

Common Issues

Problem: ModuleNotFoundError: No module named 'yaml'

# Solution: Install dependencies
pip install -r requirements.txt

Problem: FileNotFoundError: config.yaml not found

# Solution: Ensure you're running from the project directory
cd python_version
python main.py --config config.yaml --q "your question"

Problem: Tests fail with import errors

# Solution: Install pytest and other test dependencies
pip install pytest pytest-cov

Problem: Coverage report shows "No data was collected"

# Solution: Run tests without the --cov=src flag
# The project structure doesn't use a 'src' directory
pytest tests/ --cov=. --cov-report=term-missing --cov-report=html

Problem: google.generativeai errors or API quota exceeded

  • Explanation: Gemini API requires a valid API key and has rate limits
  • Solution:
    1. Verify your API key is set correctly in env.env
    2. Check your API quota at https://aistudio.google.com/
    3. Consider using the template answer agent instead:
      strategies:
        answerAgent: "template"

Problem: Zero duration in logs

  • Explanation: Operations are very fast (< 1ms), so they round to 0
  • Note: This is expected behavior for the baseline implementation

About

Intelligent RAG-based chatbot with intent classification, semantic search, and heuristic query optimization for academic assistance

Topics

Resources

Stars

Watchers

Forks

Contributors 5

Languages