RFM Customer Segmentation - ML-Driven Approach

End-to-end customer analytics pipeline that segments 4,290 customers from £6.7M in retail transactions using rule-based RFM scoring and unsupervised clustering. Includes a systematic comparison of K-Means vs Gaussian Mixture Model across multiple preprocessing strategies.

Business Impact

Metric	Value
Total revenue analyzed	£6.7M across 394k transactions
Champions identified	1,127 customers generating £4.4M (65%)
Revenue at risk	£575k from 512 at-risk customers
Actionable segments	10 with tailored marketing strategies

What Makes This Project Different

Most RFM analyses on this dataset stop at K-Means with default settings. This project goes further:

Algorithm comparison — K-Means vs GMM, systematically benchmarked
Preprocessing comparison — log-transform vs Yeo-Johnson power transform
Statistical validation — Hopkins statistic (0.956) proves data is clusterable before running algorithms
Production-ready code — Reusable RFMPipeline class, not just a notebook
Tested and automated — 61 unit tests, GitHub Actions CI across Python 3.10-3.12
Iterative development — v1.0 baseline, then v2.0 with multi-algorithm comparison via documented PR

Quick Start

git clone https://github.com/leelesemann-sys/rfm-customer-segmentation.git
cd rfm-customer-segmentation
pip install -r requirements.txt

python run_pipeline.py                    # Run full pipeline with defaults
python run_pipeline.py --k 5             # Try different cluster counts

Results

RFM Segments (Rule-Based)

Priority	Segment	Customers	Revenue	Recommended Action
High	Champions	1,127	£4.4M	VIP programs, loyalty rewards
High	At Risk	453	£508k	Win-back campaigns, 20% discount
High	Can't Lose Them	59	£67k	Personal outreach, account managers
Medium	Loyal Customers	802	£994k	Upselling, cross-sell
Medium	New Customers	136	£44k	Onboarding, next purchase incentive
Low	Lost	798	£294k	Low-cost reactivation only
Low	Hibernating	399	£179k	Mass email campaigns

Algorithm Comparison

Algorithm	Transform	Silhouette	Davies-Bouldin	Winner?
K-Means	log	0.380	0.857	Best
K-Means	Yeo-Johnson	0.338	1.019
GMM	Yeo-Johnson	0.197	1.768
GMM	log	0.112	1.851

Key insight: Contrary to Shobayo et al. (2023) who found GMM superior (Silhouette 0.80 vs 0.62), K-Means outperforms GMM on this dataset. The log-transform makes RFM features approximately spherical, which is exactly what K-Means assumes. GMM's flexibility (elliptical clusters) adds complexity without improving separation.

K-Means Clusters (K=4)

Cluster	Size	Avg. Recency	Avg. Purchases	Avg. Spend
Inactive	921	260 days	1	£356
Regular	1,341	59 days	1	£359
VIP Regulars	1,434	47 days	4	£1,442
Super VIPs	594	19 days	15	£6,457

Dashboards

Methodology

Raw Data (541k rows)
    │
    ├── 1. Data Cleaning ──────────── 394k transactions retained (72.7%)
    ├── 2. RFM Aggregation ────────── 4,290 customer profiles
    ├── 3. Quintile Scoring ───────── R/F/M scores (1-5 scale)
    ├── 4. Rule-Based Segments ────── 10 business segments
    ├── 5. Hopkins Statistic ──────── 0.956 (clustering validated)
    ├── 6. Preprocessing ──────────── log-transform vs Yeo-Johnson
    ├── 7. Algorithm Comparison ───── K-Means vs GMM (4 combinations)
    └── 8. Best Model ─────────────── K-Means + log (Silhouette: 0.380)

Tech Stack

Category	Tools
Language	Python 3.11
Data	pandas, numpy
ML	scikit-learn (K-Means, GMM, Hopkins, Yeo-Johnson)
Visualization	matplotlib, seaborn
Testing	pytest (61 tests), pytest-cov
CI/CD	GitHub Actions (Python 3.10, 3.11, 3.12)

Project Structure

rfm-customer-segmentation/
├── src/
│   ├── __init__.py
│   └── rfm_pipeline.py               # Reusable pipeline class (K-Means, GMM, Hopkins)
├── notebooks/
│   ├── 01_data_exploration.ipynb      # EDA & data cleaning
│   └── 02_rfm_analysis.ipynb         # RFM scoring & clustering
├── tests/
│   ├── conftest.py                    # Shared test fixtures (50 synthetic customers)
│   └── test_pipeline.py              # 61 unit tests across 10 test classes
├── visualizations/                    # 7 publication-ready PNGs
├── data/
│   └── online_retail_clean.csv.zip   # Cleaned dataset (394k transactions)
├── run_pipeline.py                    # CLI entrypoint (full pipeline + all visualizations)
├── .github/workflows/test.yml         # CI: tests + coverage badge
└── requirements.txt

Dataset

Source: UCI Machine Learning Repository — Online Retail Period: Dec 2010 -- Dec 2011 (12.4 months) Size: 541,909 transactions | 4,290 unique customers | UK-based (89.1%)

Version History

Version	What changed	PR
v1.0	Baseline: RFM + K-Means, 36 tests, CI	--
v2.0	+GMM, +Yeo-Johnson, +Hopkins, 61 tests	#1

Future Enhancements

Predictive CLV model (Random Forest / XGBoost)
Churn prediction classifier
Real-time segmentation API (FastAPI)
Interactive dashboard (Streamlit or Power BI)

Author

Lee Christian Lesemann Azure AI Engineer | Customer Analytics Consultant Previous: Sanofi, CSL Behring, Abbott, Teva Pharmaceuticals, IQVIA

License

MIT License -- see LICENSE for details

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RFM Customer Segmentation - ML-Driven Approach

Business Impact

What Makes This Project Different

Quick Start

Results

RFM Segments (Rule-Based)

Algorithm Comparison

K-Means Clusters (K=4)

Dashboards

Methodology

Tech Stack

Project Structure

Dataset

Version History

Future Enhancements

Author

License

About

Uh oh!

Releases 2

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github		.github
data		data
notebooks		notebooks
src		src
tests		tests
visualizations		visualizations
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py

License

leelesemann-sys/rfm-customer-segmentation

Folders and files

Latest commit

History

Repository files navigation

RFM Customer Segmentation - ML-Driven Approach

Business Impact

What Makes This Project Different

Quick Start

Results

RFM Segments (Rule-Based)

Algorithm Comparison

K-Means Clusters (K=4)

Dashboards

Methodology

Tech Stack

Project Structure

Dataset

Version History

Future Enhancements

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 3

Uh oh!

Languages

Packages