Skip to content

Customer segmentation pipeline comparing K-Means vs GMM clustering on £6.7M retail data. 10 RFM segments, Hopkins-validated clustering, 61 tests, CI/CD. Built with scikit-learn.

License

Notifications You must be signed in to change notification settings

leelesemann-sys/rfm-customer-segmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RFM Customer Segmentation - ML-Driven Approach

Python Tests Coverage CI License

End-to-end customer analytics pipeline that segments 4,290 customers from £6.7M in retail transactions using rule-based RFM scoring and unsupervised clustering. Includes a systematic comparison of K-Means vs Gaussian Mixture Model across multiple preprocessing strategies.


Business Impact

Metric Value
Total revenue analyzed £6.7M across 394k transactions
Champions identified 1,127 customers generating £4.4M (65%)
Revenue at risk £575k from 512 at-risk customers
Actionable segments 10 with tailored marketing strategies

What Makes This Project Different

Most RFM analyses on this dataset stop at K-Means with default settings. This project goes further:

  1. Algorithm comparison — K-Means vs GMM, systematically benchmarked
  2. Preprocessing comparison — log-transform vs Yeo-Johnson power transform
  3. Statistical validation — Hopkins statistic (0.956) proves data is clusterable before running algorithms
  4. Production-ready code — Reusable RFMPipeline class, not just a notebook
  5. Tested and automated — 61 unit tests, GitHub Actions CI across Python 3.10-3.12
  6. Iterative developmentv1.0 baseline, then v2.0 with multi-algorithm comparison via documented PR

Quick Start

git clone https://github.com/leelesemann-sys/rfm-customer-segmentation.git
cd rfm-customer-segmentation
pip install -r requirements.txt

python run_pipeline.py                    # Run full pipeline with defaults
python run_pipeline.py --k 5             # Try different cluster counts

Results

RFM Segments (Rule-Based)

RFM Segment Overview

Priority Segment Customers Revenue Recommended Action
High Champions 1,127 £4.4M VIP programs, loyalty rewards
High At Risk 453 £508k Win-back campaigns, 20% discount
High Can't Lose Them 59 £67k Personal outreach, account managers
Medium Loyal Customers 802 £994k Upselling, cross-sell
Medium New Customers 136 £44k Onboarding, next purchase incentive
Low Lost 798 £294k Low-cost reactivation only
Low Hibernating 399 £179k Mass email campaigns

Algorithm Comparison

Algorithm Comparison

Algorithm Transform Silhouette Davies-Bouldin Winner?
K-Means log 0.380 0.857 Best
K-Means Yeo-Johnson 0.338 1.019
GMM Yeo-Johnson 0.197 1.768
GMM log 0.112 1.851

Key insight: Contrary to Shobayo et al. (2023) who found GMM superior (Silhouette 0.80 vs 0.62), K-Means outperforms GMM on this dataset. The log-transform makes RFM features approximately spherical, which is exactly what K-Means assumes. GMM's flexibility (elliptical clusters) adds complexity without improving separation.

K-Means Clusters (K=4)

K-Means Comparison

Cluster Size Avg. Recency Avg. Purchases Avg. Spend
Inactive 921 260 days 1 £356
Regular 1,341 59 days 1 £359
VIP Regulars 1,434 47 days 4 £1,442
Super VIPs 594 19 days 15 £6,457

Dashboards

Executive Summary K-Means Elbow


Methodology

Raw Data (541k rows)
    │
    ├── 1. Data Cleaning ──────────── 394k transactions retained (72.7%)
    ├── 2. RFM Aggregation ────────── 4,290 customer profiles
    ├── 3. Quintile Scoring ───────── R/F/M scores (1-5 scale)
    ├── 4. Rule-Based Segments ────── 10 business segments
    ├── 5. Hopkins Statistic ──────── 0.956 (clustering validated)
    ├── 6. Preprocessing ──────────── log-transform vs Yeo-Johnson
    ├── 7. Algorithm Comparison ───── K-Means vs GMM (4 combinations)
    └── 8. Best Model ─────────────── K-Means + log (Silhouette: 0.380)

Tech Stack

Category Tools
Language Python 3.11
Data pandas, numpy
ML scikit-learn (K-Means, GMM, Hopkins, Yeo-Johnson)
Visualization matplotlib, seaborn
Testing pytest (61 tests), pytest-cov
CI/CD GitHub Actions (Python 3.10, 3.11, 3.12)

Project Structure

rfm-customer-segmentation/
├── src/
│   ├── __init__.py
│   └── rfm_pipeline.py               # Reusable pipeline class (K-Means, GMM, Hopkins)
├── notebooks/
│   ├── 01_data_exploration.ipynb      # EDA & data cleaning
│   └── 02_rfm_analysis.ipynb         # RFM scoring & clustering
├── tests/
│   ├── conftest.py                    # Shared test fixtures (50 synthetic customers)
│   └── test_pipeline.py              # 61 unit tests across 10 test classes
├── visualizations/                    # 7 publication-ready PNGs
├── data/
│   └── online_retail_clean.csv.zip   # Cleaned dataset (394k transactions)
├── run_pipeline.py                    # CLI entrypoint (full pipeline + all visualizations)
├── .github/workflows/test.yml         # CI: tests + coverage badge
└── requirements.txt

Dataset

Source: UCI Machine Learning Repository — Online Retail Period: Dec 2010 -- Dec 2011 (12.4 months) Size: 541,909 transactions | 4,290 unique customers | UK-based (89.1%)


Version History

Version What changed PR
v1.0 Baseline: RFM + K-Means, 36 tests, CI --
v2.0 +GMM, +Yeo-Johnson, +Hopkins, 61 tests #1

Future Enhancements

  • Predictive CLV model (Random Forest / XGBoost)
  • Churn prediction classifier
  • Real-time segmentation API (FastAPI)
  • Interactive dashboard (Streamlit or Power BI)

Author

Lee Christian Lesemann Azure AI Engineer | Customer Analytics Consultant Previous: Sanofi, CSL Behring, Abbott, Teva Pharmaceuticals, IQVIA

LinkedIn


License

MIT License -- see LICENSE for details

About

Customer segmentation pipeline comparing K-Means vs GMM clustering on £6.7M retail data. 10 RFM segments, Hopkins-validated clustering, 61 tests, CI/CD. Built with scikit-learn.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •