Releases: leelesemann-sys/rfm-customer-segmentation
Releases · leelesemann-sys/rfm-customer-segmentation
v2.0 — Multi-Algorithm Comparison
v2.0: Multi-Algorithm Comparison (K-Means vs GMM)
Systematic comparison of clustering algorithms and preprocessing methods.
What's new
- GMM clustering (
cluster_gmm()) — Gaussian Mixture Model with BIC/AIC model selection - Yeo-Johnson transform — Power transform as alternative to log1p
- Hopkins statistic — Validates clustering tendency before running algorithms (0.956)
compare_algorithms()— Runs all 4 combinations in one call- Algorithm comparison dashboard — New 7th visualization
- 61 unit tests (up from 36 in v1.0)
Algorithm Comparison Results
| Algorithm | Transform | Silhouette | Davies-Bouldin |
|---|---|---|---|
| K-Means | log | 0.380 | 0.857 |
| K-Means | Yeo-Johnson | 0.338 | 1.019 |
| GMM | log | 0.112 | 1.851 |
| GMM | Yeo-Johnson | 0.197 | 1.768 |
Key Finding
K-Means + log-transform remains the best approach for this dataset. Unlike some published results (Shobayo et al., 2023), GMM does not improve cluster quality here — likely because log-transformed RFM features already favor spherical clusters, which is K-Means' strength.
Changelog vs v1.0
src/rfm_pipeline.py: +258 lines (GMM, Yeo-Johnson, Hopkins, compare_algorithms)tests/test_pipeline.py: +189 lines (25 new tests)run_pipeline.py: +136 lines (comparison pipeline + visualization)- New:
visualizations/7_algorithm_comparison.png
v1.0 — RFM + K-Means Baseline
v1.0: Rule-Based RFM + K-Means Baseline
First complete version of the customer segmentation pipeline.
What's included
- 10 RFM segments via rule-based quintile scoring (R/F/M 1-5 scale)
- K-Means clustering (K=4, Silhouette Score: 0.380) with log-transform + StandardScaler
- Reusable
RFMPipelineclass (src/rfm_pipeline.py) with full type hints and docstrings - 36 unit tests with pytest, GitHub Actions CI across Python 3.10-3.12
- 82% test coverage with auto-generated badge
- CLI entrypoint (
run_pipeline.py) for reproducible pipeline execution - 6 publication-ready visualizations
Key findings
| Metric | Value |
|---|---|
| Total revenue | £6.7M |
| Champions | 1,127 customers (£4.4M, 65.4%) |
| At-risk revenue | £575k (512 customers) |
| K-Means Silhouette | 0.380 |
Known limitations (addressed in v2.0)
- Only K-Means clustering (no algorithm comparison)
- Simple log-transform (no Yeo-Johnson/Box-Cox)
- No clustering tendency test (Hopkins Statistic)
- No outlier handling strategy for clustering