Welcome to the IEEE Fraud Detection Project! This project leverages machine learning techniques to detect fraudulent e-commerce transactions using real-world data from the IEEE-CIS Fraud Detection competition.
- Overview
- Features
- Project Files & Notebooks
- Installation
- Local Development
- API Usage
- Dockerization & Deployment
- Testing
- Future Improvements
This project implements a full-stack fraud detection solution using Python. It includes:
-
Data Preprocessing & Feature Engineering:
Handling missing values, extracting time features, grouping email domains, processing address and distance information, and aggregating binary flags. -
Machine Learning Models:
Multiple models were developed and compared (XGBoost, LightGBM, Random Forest, and a Neural Network prototype) with boosting models (XGBoost/LightGBM) achieving strong AUC (up to 0.949) and balanced precision/recall performance. -
API Backend:
A FastAPI backend serves predictions. It processes incoming JSON data (transaction and identity tables), applies preprocessing and feature engineering, and returns a fraud probability. -
Frontend:
A simple Streamlit-based frontend allows users to input data and view predictions, demonstrating an end-to-end solution. -
Dockerization & Deployment:
The application is containerized using Docker and deployed on Google Cloud Run, making it available online: https://fraud-detection-frontend-x2ugjgse3q-uc.a.run.app
-
Robust Data Processing:
Handles a variety of feature types (numerical, categorical, binary) and performs extensive feature engineering. -
Modeling:
Implements gradient boosting models (XGBoost and LightGBM) with competitive performance and an initial Random Forest baseline. -
End-to-End Pipeline:
From data ingestion to API-based inference, ensuring consistency across training and production. -
Interactive Frontend:
A Streamlit-based UI for demoing predictions interactively. -
Production-Ready Deployment:
Dockerized application deployed on Google Cloud Run.
The project is organized into several key files and folders to facilitate development and experimentation:
-
data_processing.py
Responsible for loading, merging, and orchestrating data processing along with related functions. -
feature_engineering.py
Contains methods for encoding, transforming features, and other feature engineering techniques. -
EDA.ipynb
A Jupyter notebook for Exploratory Data Analysis (EDA) to analyze raw data before processing. -
FeatureEngineering.ipynb
Explores feature engineering in detail, including close-up analysis of features and the feature importance from applied models. -
ModelDevelopment.ipynb
Notebook for training various models and comparing their performances. -
helpers.py
A collection of helper functions used throughout the project. -
models folder
Contains model-related Python files and aconfig.pyfile to store model configurations.
- Clone the Repository:
git clone https://github.com/elnurisg/ieee-fraud-detection.git- Set Up Virtual Environment:
python3 -m venv .venv
source .venv/bin/activate - Install Dependencies:
pip install --upgrade pip
pip install -r requirements.txtRunning the Backend:
Navigate to the project root and run:
uvicorn app.backend.main:app --reloadThe API will be available at http://localhost:8000.
Running the Frontend:
Navigate to the app/frontend directory and run:
cd app/frontend
streamlit run app.pyThis opens a browser window with the Streamlit app.
Running Tests:
From the project root, run:
pytestEndpoints
- GET /
Returns a welcome message.
- GET /health
Health check endpoint that returns the status of the API.
- POST /predict
Accepts a JSON payload with two keys: transaction_table and identity_table.
Example payload:
{
"transaction_table": { ... },
"identity_table": { ... }
}Response:
Returns the predicted fraud probability.
Dockerization:
The project is containerized using a Dockerfile located at the root of the repository. To build and run locally:
docker build -t fraud-api .
docker run -p 8000:8000 fraud-apiDeployment:
The application is deployed on Google Cloud Run. Use the provided deploy.sh script to build, push, and deploy your container:
bash deploy.shUnit tests are written using pytest and are located in the tests/ directory. They cover:
- Data processing and merging
- Feature engineering functions
- Helper utilities for model saving/loading and evaluation
- API endpoints using FastAPI’s TestClient
To run the tests, execute:
pytest-
Model Tuning & Ensembling:
Further optimize hyperparameters, and possibly ensemble multiple models (e.g., stacking XGBoost and LightGBM).
-
Advanced Feature Engineering:
Explore additional feature interactions, frequency encoding, and domain-specific transformations.
-
Neural Network Models:
Experiment with MLPs or more advanced neural architectures for tabular data.
-
Enhanced Frontend:
Expand the Streamlit app with more interactive visualizations and a user-friendly interface.
-
CI/CD & Monitoring: Implement CI/CD (e.g., with GitHub Actions) and integrate monitoring/logging for production readiness.