Skip to content

Releases: leelesemann-sys/rfm-customer-segmentation

v2.0 — Multi-Algorithm Comparison

12 Feb 02:12
ef3b365

Choose a tag to compare

v2.0: Multi-Algorithm Comparison (K-Means vs GMM)

Systematic comparison of clustering algorithms and preprocessing methods.

What's new

  • GMM clustering (cluster_gmm()) — Gaussian Mixture Model with BIC/AIC model selection
  • Yeo-Johnson transform — Power transform as alternative to log1p
  • Hopkins statistic — Validates clustering tendency before running algorithms (0.956)
  • compare_algorithms() — Runs all 4 combinations in one call
  • Algorithm comparison dashboard — New 7th visualization
  • 61 unit tests (up from 36 in v1.0)

Algorithm Comparison Results

Algorithm Transform Silhouette Davies-Bouldin
K-Means log 0.380 0.857
K-Means Yeo-Johnson 0.338 1.019
GMM log 0.112 1.851
GMM Yeo-Johnson 0.197 1.768

Key Finding

K-Means + log-transform remains the best approach for this dataset. Unlike some published results (Shobayo et al., 2023), GMM does not improve cluster quality here — likely because log-transformed RFM features already favor spherical clusters, which is K-Means' strength.

Changelog vs v1.0

  • src/rfm_pipeline.py: +258 lines (GMM, Yeo-Johnson, Hopkins, compare_algorithms)
  • tests/test_pipeline.py: +189 lines (25 new tests)
  • run_pipeline.py: +136 lines (comparison pipeline + visualization)
  • New: visualizations/7_algorithm_comparison.png

v1.0 — RFM + K-Means Baseline

12 Feb 02:03

Choose a tag to compare

v1.0: Rule-Based RFM + K-Means Baseline

First complete version of the customer segmentation pipeline.

What's included

  • 10 RFM segments via rule-based quintile scoring (R/F/M 1-5 scale)
  • K-Means clustering (K=4, Silhouette Score: 0.380) with log-transform + StandardScaler
  • Reusable RFMPipeline class (src/rfm_pipeline.py) with full type hints and docstrings
  • 36 unit tests with pytest, GitHub Actions CI across Python 3.10-3.12
  • 82% test coverage with auto-generated badge
  • CLI entrypoint (run_pipeline.py) for reproducible pipeline execution
  • 6 publication-ready visualizations

Key findings

Metric Value
Total revenue £6.7M
Champions 1,127 customers (£4.4M, 65.4%)
At-risk revenue £575k (512 customers)
K-Means Silhouette 0.380

Known limitations (addressed in v2.0)

  • Only K-Means clustering (no algorithm comparison)
  • Simple log-transform (no Yeo-Johnson/Box-Cox)
  • No clustering tendency test (Hopkins Statistic)
  • No outlier handling strategy for clustering