During this practical work, I explored how AI agents can automate data analysis tasks. I built and tested two different approaches to see which one works better for real-world data science problems. Think of it as having virtual data science assistants that can handle everything from data exploration to model training and report writing.
I used the famous Titanic dataset (predicting passenger survival) and a house pricing dataset to put both systems through their paces. The goal was simple: let the AI agents do the heavy lifting while I evaluate how well they perform and where they struggle.
This system works like a traditional pipeline with four specialized agents working in sequence. Each agent has specific tools at their disposal, but all the Python code runs in the background where you can't see it.
How it works:
- Project Planner: Analyzes the business problem and decides on an approach
- Data Analyst: Explores the dataset using pandas (you get statistics, but don't see the actual code)
- Modeler: Trains machine learning models with scikit-learn
- Report Writer: Puts everything together into a nice LaTeX report
The good parts:
- Super easy to use - just run it and wait for results
- Works autonomously without needing much intervention
- Produces clean, professional reports
The not-so-good parts:
- You can't see what's happening under the hood
- If something goes wrong, it's hard to debug
- Sometimes it makes mistakes (like including ID columns in training) and you won't notice until you check the results carefully
Files to run:
python main_classification.py # For Titanic survival prediction
python main_regression.py # For house price predictionThis one takes a completely different approach. Instead of hiding everything, it generates Python code that you can actually read, modify, and reuse. It's like having a coding buddy who writes the analysis for you.
How it works:
- Code Planner: Figures out what code needs to be written
- Code Generator: Actually writes complete Python scripts
- Code Executor: Runs the code and checks for errors
- Results Interpreter: Explains what the results mean
The good parts:
- Total transparency - you see every line of code
- Can fix itself when it hits errors (I watched it correct 4 mistakes autonomously!)
- You can extract the code and use it for other projects
- Great for learning because you see the methodology
The not-so-good parts:
- Uses more API tokens because it generates longer responses
- Quality depends on how well you describe what you want
- Takes longer to run because of the self-correction iterations
Files to run:
python main_code_interpreter.py classification # For Titanic
python main_code_interpreter.py regression # For house pricesBoth systems initially made the same rookie mistake: they included the PassengerId column (just a number from 1 to 891) in the training features. This created fake correlations and inflated the accuracy scores. System 2 made it way easier to spot this bug because I could literally read the code line by line. With System 1, I had to dig through tool outputs to figure out what was happening.
The coolest thing I observed was System 2's ability to debug itself. During one test, it hit four errors in a row:
- Syntax error with a broken f-string
- Warning about escape sequences
- Tried to extract titles from the Name column... after already deleting it
- Finally figured out it needed to extract titles BEFORE dropping columns
Each iteration consumed API tokens, but watching an AI agent reason through its mistakes and fix them was genuinely impressive.
I'm using OpenAI GPT-4o for this project, which doesn't have the strict rate limits that free services have. However, I did initially try Groq's free tier (100k tokens/day) and hit the limit pretty quickly - a single run with the self-correction iterations consumed about 40k tokens!
For production use or if you want to avoid API costs entirely, switching to Ollama with a local model would be the way to go. The code supports all these options through a simple config change in .env.
You'll need Python 3.12 and an OpenAI API key (I'm using GPT-4o for this project).
# Clone and navigate to the project
cd TP_Agentic_AI
# Create virtual environment
python -m venv .venv
.\.venv\Scripts\Activate.ps1
# Install dependencies
pip install -r requirements.txt
# Configure your API key
# Edit .env and add your OPENAI_API_KEY
# The project is configured with LLM_MODE=openai by defaultRun System 1 (Hidden Tools)
python main_classification.py
# Wait 3-5 minutes, generates outputs/titanic_report.texpython main_code_interpreter.py classification
# Takes longer (5-10 min) but shows all code generation
# Generates outputs/titanic_code_report.tex# Using WSL with pdflatex installed
wsl pdflatex -interaction=nonstopmode outputs/titanic_report.texTP_Agentic_AI/
├── agents.py # System 1 agents (4 agents with hidden tools)
├── agents_code_interpreter.py # System 2 agents (code generators)
├── crew_setup.py # System 1 task definitions
├── tools.py # Python execution tools for both systems
├── llama_llm.py # LLM configuration (OpenAI/Groq/Ollama)
├── main_classification.py # System 1 entry point (Titanic)
├── main_regression.py # System 1 entry point (House Prices)
├── main_code_interpreter.py # System 2 entry point (both datasets)
├── data/
│ ├── titanic.csv # Classification dataset (891 samples)
│ └── house_prices.csv # Regression dataset (20640 samples)
├── outputs/
│ ├── titanic_report.tex # System 1 classification report
│ └── titanic_code_report.tex # System 2 classification report
└── Analysis_Crew_Systems.pdf # Comparative analysis
For this project, I'm using OpenAI GPT-4o as the primary language model. The .env file is configured with:
LLM_MODE=openai
OPENAI_API_KEY=your_key_here
The system supports multiple LLM providers through llama_llm.py. You can switch by changing LLM_MODE in .env:
Option 1: OpenAI (Current Setup)
LLM_MODE=openai
OPENAI_API_KEY=sk-...- Best quality and reliability
- Costs money but generous rate limits
- GPT-4o gives excellent results
Option 2: Groq (Free Alternative)
LLM_MODE=groq
GROQ_API_KEY=gsk_...- Free tier with 100k tokens/day
- Fast inference with Llama 3.3 70B
- Hit rate limits during testing
Option 3: HuggingFace
LLM_MODE=huggingface
HUGGINGFACE_API_KEY=hf_...- Access to Llama 3.3 70B Instruct
- Free tier available
- Good for experimentation
Option 4: Ollama (Local)
LLM_MODE=ollama
# No API key needed, runs on your machine- CrewAI 1.6.1: Multi-agent orchestration framework
- OpenAI GPT-4o: Primary LLM used for testing both systems
- Python 3.12: Core programming language
- Pandas & Scikit-learn: Data analysis and machine learning
- LaTeX: Professional report generation
Note: The critical analysis (Analysis_Crew_Systems.pdf) contains a detailed comparison of both systems based on actual test results.