This repository contains multiple unsupervised anomaly detection techniques applied to various datasets. The goal is to detect outliers or anomalies in data using clustering, model-based, and statistical approaches.
The project is organized into three main directories:
clustering-based/– Clustering-based anomaly detectionmodel-based/– Model-based anomaly detectionstatistical-methods/– Statistical anomaly detection
Each directory contains notebooks with full workflow, including data preprocessing, model implementation, evaluation, visualization, and saving results.
- Clustering-Based Anomaly Detection
- Model-Based Anomaly Detection
- Statistical Methods for Anomaly Detection
- Dependencies
This folder implements anomaly detection using unsupervised clustering algorithms.
Technique:
- Form clusters of similar data points using KMeans
- Compute distance from cluster centroids
- Points in the top 5% distance percentile are marked as anomalies
Dataset:
- Mall Customer Segmentation Dataset (features: Annual Income, Spending Score)
Steps Performed:
- Load dataset
- Scale numeric features
- Determine optimal cluster count using the Elbow Method
- Fit KMeans (k=5)
- Compute distances from centroids and determine anomaly threshold (95th percentile)
- Mark anomalies
- Visualize anomalies vs normal points
Visualizations:
- Elbow curve for optimal cluster selection
- Scatterplot showing anomalies (red) vs normal points (blue)
Technique:
- DBSCAN detects dense regions as clusters and sparse regions as anomalies
- Works with irregular shapes and density
Dataset:
- UCI Wholesale Customers Dataset (features: spending on Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen)
Steps Performed:
- Load and clean dataset
- Scale features using
StandardScaler - Plot K-distance graph to estimate
eps - Apply DBSCAN (
eps ≈ 1.2,min_samples = 5) - Label cluster = -1 as anomalies
- Reduce to 2D using PCA
- Visualize anomalies vs normal points
Advantages of DBSCAN:
- Automatically detects outliers
- Works for non-linear cluster shapes
- Handles irregular density patterns
Next Projects:
- Gaussian Mixture Model (GMM) Anomaly Detection
This folder includes unsupervised model-based techniques applied to the Credit Card Fraud Detection dataset (Kaggle).
Dataset Statistics:
| Metric | Value |
|---|---|
| Total rows | 284,807 |
| Fraud (Class=1) | 492 |
| Legit (Class=0) | 284,315 |
| Fraud Rate | ≈ 0.1727% |
Implemented Models:
-
Goal: Isolate anomalies in extremely imbalanced datasets
-
Steps:
- Drop target column, standardize features
- Compute contamination from true fraud ratio
- Train
IsolationForest - Predict anomalies
- Evaluate with confusion matrix, classification report, PCA visualization
- Save results
-
Conclusion: Provides a solid unsupervised baseline; useful when labels are scarce.
-
Goal: Detect anomalies using local density deviation
-
Steps:
- Drop duplicates, separate features/labels, standardize features
- Fit
LocalOutlierFactorwith contamination - Predict anomalies and compute LOF scores
- Evaluate with classification report and ROC-AUC
- Visualize LOF score distribution, PCA scatter, ROC curve
- Save results
-
Limitations:
- Fails on high-dimensional, extremely imbalanced data where anomalies are not density-based
- Scores are training-only if
novelty=False
-
Conclusion: Useful only when anomalies are local density outliers.
This folder implements classical statistical anomaly detection techniques, including:
-
Z-Score Based Detection (
zscore_outlier_detection.ipynb)- Detects points beyond a threshold number of standard deviations from the mean
-
IQR-Based Detection (
iqr_outlier_detection.ipynb)- Detects points outside the interquartile range (1.5×IQR rule)
-
Rolling Statistics for Time Series (
rolling_stats_anomaly_detection.ipynb)- Detects anomalies based on deviations from moving averages and rolling standard deviations
Workflow:
- Load and clean dataset
- Compute statistical metrics
- Apply thresholds to flag anomalies
- Visualize anomalies with plots
- Save results
Advantages:
- Simple and interpretable
- Fast for univariate and low-dimensional datasets
- Useful for exploratory analysis and baseline detection
All projects rely on standard Python data science libraries:
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
Additional dependencies per project may include:
scikit-learnfor clustering, isolation forest, LOFPCAfor dimensionality reductionStandardScalerfor feature scaling
This repository provides a comprehensive collection of anomaly detection approaches, covering:
| Approach Type | Techniques Implemented | Use Case & Notes |
|---|---|---|
| Clustering-Based | KMeans, DBSCAN | Detect anomalies by distance from cluster centroids or density deviations; effective for multidimensional datasets with structure |
| Model-Based | Isolation Forest, LOF | Unsupervised detection in imbalanced datasets; Isolation Forest is robust, LOF shows limitations on high-dimensional rare events |
| Statistical Methods | Z-Score, IQR, Rolling Statistics | Fast, interpretable, univariate or low-dimensional datasets; good for baselines |
This repository can serve as a reference and hands-on practice resource for understanding anomaly detection from multiple perspectives: clustering, model-based, and statistical approaches.