GitHub - bechir23/Agentic-AI-Data-Science-Assistant-: Comparative study of two Agentic AI architectures for automated data science: hidden-tool agents vs transparent code-generating agents. Built with CrewAI, OpenAI GPT-4o, tested on Titanic & House Prices datasets.

What This Project Is About

During this practical work, I explored how AI agents can automate data analysis tasks. I built and tested two different approaches to see which one works better for real-world data science problems. Think of it as having virtual data science assistants that can handle everything from data exploration to model training and report writing.

I used the famous Titanic dataset (predicting passenger survival) and a house pricing dataset to put both systems through their paces. The goal was simple: let the AI agents do the heavy lifting while I evaluate how well they perform and where they struggle.

The Two Systems I Tested

System 1: The Behind-the-Scenes Approach

This system works like a traditional pipeline with four specialized agents working in sequence. Each agent has specific tools at their disposal, but all the Python code runs in the background where you can't see it.

How it works:

Project Planner: Analyzes the business problem and decides on an approach
Data Analyst: Explores the dataset using pandas (you get statistics, but don't see the actual code)
Modeler: Trains machine learning models with scikit-learn
Report Writer: Puts everything together into a nice LaTeX report

The good parts:

Super easy to use - just run it and wait for results
Works autonomously without needing much intervention
Produces clean, professional reports

The not-so-good parts:

You can't see what's happening under the hood
If something goes wrong, it's hard to debug
Sometimes it makes mistakes (like including ID columns in training) and you won't notice until you check the results carefully

Files to run:

python main_classification.py    # For Titanic survival prediction
python main_regression.py        # For house price prediction

System 2: The Transparent Code Generator

This one takes a completely different approach. Instead of hiding everything, it generates Python code that you can actually read, modify, and reuse. It's like having a coding buddy who writes the analysis for you.

How it works:

Code Planner: Figures out what code needs to be written
Code Generator: Actually writes complete Python scripts
Code Executor: Runs the code and checks for errors
Results Interpreter: Explains what the results mean

The good parts:

Total transparency - you see every line of code
Can fix itself when it hits errors (I watched it correct 4 mistakes autonomously!)
You can extract the code and use it for other projects
Great for learning because you see the methodology

The not-so-good parts:

Uses more API tokens because it generates longer responses
Quality depends on how well you describe what you want
Takes longer to run because of the self-correction iterations

Files to run:

python main_code_interpreter.py classification    # For Titanic
python main_code_interpreter.py regression        # For house prices

What I Actually Discovered

The PassengerId Bug

Both systems initially made the same rookie mistake: they included the PassengerId column (just a number from 1 to 891) in the training features. This created fake correlations and inflated the accuracy scores. System 2 made it way easier to spot this bug because I could literally read the code line by line. With System 1, I had to dig through tool outputs to figure out what was happening.

Self-Correction (Self-healing)

The coolest thing I observed was System 2's ability to debug itself. During one test, it hit four errors in a row:

Syntax error with a broken f-string
Warning about escape sequences
Tried to extract titles from the Name column... after already deleting it
Finally figured out it needed to extract titles BEFORE dropping columns

Each iteration consumed API tokens, but watching an AI agent reason through its mistakes and fix them was genuinely impressive.

API Limits and Costs

I'm using OpenAI GPT-4o for this project, which doesn't have the strict rate limits that free services have. However, I did initially try Groq's free tier (100k tokens/day) and hit the limit pretty quickly - a single run with the self-correction iterations consumed about 40k tokens!

For production use or if you want to avoid API costs entirely, switching to Ollama with a local model would be the way to go. The code supports all these options through a simple config change in .env.

Quick Start Guide

Prerequisites

You'll need Python 3.12 and an OpenAI API key (I'm using GPT-4o for this project).

Setup

# Clone and navigate to the project
cd TP_Agentic_AI

# Create virtual environment
python -m venv .venv
.\.venv\Scripts\Activate.ps1

# Install dependencies
pip install -r requirements.txt

# Configure your API key
# Edit .env and add your OPENAI_API_KEY
# The project is configured with LLM_MODE=openai by default

Run System 1 (Hidden Tools)

python main_classification.py
# Wait 3-5 minutes, generates outputs/titanic_report.tex

Run System 2 (Visible Code)

python main_code_interpreter.py classification
# Takes longer (5-10 min) but shows all code generation
# Generates outputs/titanic_code_report.tex

Compile Reports to PDF

# Using WSL with pdflatex installed
wsl pdflatex -interaction=nonstopmode outputs/titanic_report.tex

Project Structure

TP_Agentic_AI/
├── agents.py                      # System 1 agents (4 agents with hidden tools)
├── agents_code_interpreter.py     # System 2 agents (code generators)
├── crew_setup.py                  # System 1 task definitions
├── tools.py                       # Python execution tools for both systems
├── llama_llm.py                   # LLM configuration (OpenAI/Groq/Ollama)
├── main_classification.py         # System 1 entry point (Titanic)
├── main_regression.py             # System 1 entry point (House Prices)
├── main_code_interpreter.py       # System 2 entry point (both datasets)
├── data/
│   ├── titanic.csv               # Classification dataset (891 samples)
│   └── house_prices.csv          # Regression dataset (20640 samples)
├── outputs/
│   ├── titanic_report.tex        # System 1 classification report
│   └── titanic_code_report.tex   # System 2 classification report
└── Analysis_Crew_Systems.pdf     # Comparative analysis

LLM Configuration

For this project, I'm using OpenAI GPT-4o as the primary language model. The .env file is configured with:

LLM_MODE=openai
OPENAI_API_KEY=your_key_here

Alternative LLM Options

The system supports multiple LLM providers through llama_llm.py. You can switch by changing LLM_MODE in .env:

Option 1: OpenAI (Current Setup)

LLM_MODE=openai
OPENAI_API_KEY=sk-...

Best quality and reliability
Costs money but generous rate limits
GPT-4o gives excellent results

Option 2: Groq (Free Alternative)

LLM_MODE=groq
GROQ_API_KEY=gsk_...

Free tier with 100k tokens/day
Fast inference with Llama 3.3 70B
Hit rate limits during testing

Option 3: HuggingFace

LLM_MODE=huggingface
HUGGINGFACE_API_KEY=hf_...

Access to Llama 3.3 70B Instruct
Free tier available
Good for experimentation

Option 4: Ollama (Local)

LLM_MODE=ollama
# No API key needed, runs on your machine

Technologies Used

CrewAI 1.6.1: Multi-agent orchestration framework
OpenAI GPT-4o: Primary LLM used for testing both systems
Python 3.12: Core programming language
Pandas & Scikit-learn: Data analysis and machine learning
LaTeX: Professional report generation

Note: The critical analysis (Analysis_Crew_Systems.pdf) contains a detailed comparison of both systems based on actual test results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What This Project Is About

The Two Systems I Tested

System 1: The Behind-the-Scenes Approach

System 2: The Transparent Code Generator

What I Actually Discovered

The PassengerId Bug

Self-Correction (Self-healing)

API Limits and Costs

Quick Start Guide

Prerequisites

Setup

Run System 1 (Hidden Tools)

Run System 2 (Visible Code)

Compile Reports to PDF

Project Structure

LLM Configuration

Alternative LLM Options

Technologies Used

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
__pycache__		__pycache__
data		data
outputs		outputs
.env		.env
Analysis_Crew_Systems.pdf		Analysis_Crew_Systems.pdf
README.md		README.md
agents.py		agents.py
agents_code_interpreter.py		agents_code_interpreter.py
crew_setup.py		crew_setup.py
llama_llm.py		llama_llm.py
main_classification.py		main_classification.py
main_code_interpreter.py		main_code_interpreter.py
main_regression.py		main_regression.py
requirements.txt		requirements.txt
settings.py		settings.py
titanic_code_report.pdf		titanic_code_report.pdf
tools.py		tools.py

bechir23/Agentic-AI-Data-Science-Assistant-

Folders and files

Latest commit

History

Repository files navigation

What This Project Is About

The Two Systems I Tested

System 1: The Behind-the-Scenes Approach

System 2: The Transparent Code Generator

What I Actually Discovered

The PassengerId Bug

Self-Correction (Self-healing)

API Limits and Costs

Quick Start Guide

Prerequisites

Setup

Run System 1 (Hidden Tools)

Run System 2 (Visible Code)

Compile Reports to PDF

Project Structure

LLM Configuration

Alternative LLM Options

Technologies Used

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages