This project was part of the CS506 Spring 2024 Midterm, focusing on detecting fraudulent credit card transactions using machine learning. The dataset was highly imbalanced (~0.4% fraud), requiring careful feature engineering and model selection. This project was hosted on Kaggle as a private competition.
Kaggle Competition: CS506 Midterm 2024 Achievement: Placed in Top 20 among all participants.
-
Exploratory Data Analysis (EDA)
- Identified extreme class imbalance.
- Visualized geographical clusters of fraud using
geopandas. - Found strong correlation of fraud with high transaction amounts.
-
Feature Engineering
- Added age and distance (Haversine) features.
- Created average recent spend and fraudulent_day indicators.
- Applied k-means clustering to user and merchant locations.
- Performed label encoding and feature pruning using correlation analysis.
-
Modeling & Evaluation
- Tested Decision Tree, XGBoost, and KNN classifiers.
- Decision Tree performed best due to interpretability and robustness on imbalanced data.
- Used GridSearchCV for hyperparameter tuning.
- Verified model stability across multiple validation splits.
-
Results
- Achieved a Top 20 rank on the Kaggle leaderboard.
- Consistent performance across unseen validation sets.
- explore.ipynb – Exploratory Data Analysis and visualizations.
- starter_code.ipynb – Initial setup and model experimentation.
- U48519832_Midterm_Report.pdf – Detailed report with methodology and findings.
- README.md – This file summarizing the project.
- Handling highly imbalanced datasets is challenging and requires thoughtful feature engineering.
- Geographical features can add predictive power if transformed meaningfully.
- Decision Trees provided interpretable and robust performance for this task.
- Author: Mohit Sai Gutha
© 2024 Mohit Sai Gutha | CS506 Midterm Project