End-to-end customer analytics pipeline that segments 4,290 customers from £6.7M in retail transactions using rule-based RFM scoring and unsupervised clustering. Includes a systematic comparison of K-Means vs Gaussian Mixture Model across multiple preprocessing strategies.
| Metric | Value |
|---|---|
| Total revenue analyzed | £6.7M across 394k transactions |
| Champions identified | 1,127 customers generating £4.4M (65%) |
| Revenue at risk | £575k from 512 at-risk customers |
| Actionable segments | 10 with tailored marketing strategies |
Most RFM analyses on this dataset stop at K-Means with default settings. This project goes further:
- Algorithm comparison — K-Means vs GMM, systematically benchmarked
- Preprocessing comparison — log-transform vs Yeo-Johnson power transform
- Statistical validation — Hopkins statistic (0.956) proves data is clusterable before running algorithms
- Production-ready code — Reusable
RFMPipelineclass, not just a notebook - Tested and automated — 61 unit tests, GitHub Actions CI across Python 3.10-3.12
- Iterative development — v1.0 baseline, then v2.0 with multi-algorithm comparison via documented PR
git clone https://github.com/leelesemann-sys/rfm-customer-segmentation.git
cd rfm-customer-segmentation
pip install -r requirements.txt
python run_pipeline.py # Run full pipeline with defaults
python run_pipeline.py --k 5 # Try different cluster counts| Priority | Segment | Customers | Revenue | Recommended Action |
|---|---|---|---|---|
| High | Champions | 1,127 | £4.4M | VIP programs, loyalty rewards |
| High | At Risk | 453 | £508k | Win-back campaigns, 20% discount |
| High | Can't Lose Them | 59 | £67k | Personal outreach, account managers |
| Medium | Loyal Customers | 802 | £994k | Upselling, cross-sell |
| Medium | New Customers | 136 | £44k | Onboarding, next purchase incentive |
| Low | Lost | 798 | £294k | Low-cost reactivation only |
| Low | Hibernating | 399 | £179k | Mass email campaigns |
| Algorithm | Transform | Silhouette | Davies-Bouldin | Winner? |
|---|---|---|---|---|
| K-Means | log | 0.380 | 0.857 | Best |
| K-Means | Yeo-Johnson | 0.338 | 1.019 | |
| GMM | Yeo-Johnson | 0.197 | 1.768 | |
| GMM | log | 0.112 | 1.851 |
Key insight: Contrary to Shobayo et al. (2023) who found GMM superior (Silhouette 0.80 vs 0.62), K-Means outperforms GMM on this dataset. The log-transform makes RFM features approximately spherical, which is exactly what K-Means assumes. GMM's flexibility (elliptical clusters) adds complexity without improving separation.
| Cluster | Size | Avg. Recency | Avg. Purchases | Avg. Spend |
|---|---|---|---|---|
| Inactive | 921 | 260 days | 1 | £356 |
| Regular | 1,341 | 59 days | 1 | £359 |
| VIP Regulars | 1,434 | 47 days | 4 | £1,442 |
| Super VIPs | 594 | 19 days | 15 | £6,457 |
Raw Data (541k rows)
│
├── 1. Data Cleaning ──────────── 394k transactions retained (72.7%)
├── 2. RFM Aggregation ────────── 4,290 customer profiles
├── 3. Quintile Scoring ───────── R/F/M scores (1-5 scale)
├── 4. Rule-Based Segments ────── 10 business segments
├── 5. Hopkins Statistic ──────── 0.956 (clustering validated)
├── 6. Preprocessing ──────────── log-transform vs Yeo-Johnson
├── 7. Algorithm Comparison ───── K-Means vs GMM (4 combinations)
└── 8. Best Model ─────────────── K-Means + log (Silhouette: 0.380)
| Category | Tools |
|---|---|
| Language | Python 3.11 |
| Data | pandas, numpy |
| ML | scikit-learn (K-Means, GMM, Hopkins, Yeo-Johnson) |
| Visualization | matplotlib, seaborn |
| Testing | pytest (61 tests), pytest-cov |
| CI/CD | GitHub Actions (Python 3.10, 3.11, 3.12) |
rfm-customer-segmentation/
├── src/
│ ├── __init__.py
│ └── rfm_pipeline.py # Reusable pipeline class (K-Means, GMM, Hopkins)
├── notebooks/
│ ├── 01_data_exploration.ipynb # EDA & data cleaning
│ └── 02_rfm_analysis.ipynb # RFM scoring & clustering
├── tests/
│ ├── conftest.py # Shared test fixtures (50 synthetic customers)
│ └── test_pipeline.py # 61 unit tests across 10 test classes
├── visualizations/ # 7 publication-ready PNGs
├── data/
│ └── online_retail_clean.csv.zip # Cleaned dataset (394k transactions)
├── run_pipeline.py # CLI entrypoint (full pipeline + all visualizations)
├── .github/workflows/test.yml # CI: tests + coverage badge
└── requirements.txt
Source: UCI Machine Learning Repository — Online Retail Period: Dec 2010 -- Dec 2011 (12.4 months) Size: 541,909 transactions | 4,290 unique customers | UK-based (89.1%)
| Version | What changed | PR |
|---|---|---|
| v1.0 | Baseline: RFM + K-Means, 36 tests, CI | -- |
| v2.0 | +GMM, +Yeo-Johnson, +Hopkins, 61 tests | #1 |
- Predictive CLV model (Random Forest / XGBoost)
- Churn prediction classifier
- Real-time segmentation API (FastAPI)
- Interactive dashboard (Streamlit or Power BI)
Lee Christian Lesemann Azure AI Engineer | Customer Analytics Consultant Previous: Sanofi, CSL Behring, Abbott, Teva Pharmaceuticals, IQVIA
MIT License -- see LICENSE for details




