This repository contains the official code for the Master thesis "Enhancing Financial Client Segmentation Models through Time-Series Clustering".
In this work, we propose a framework that can generate client segments in settings where only aggregate balance data are available, simulating scenarios in which detailed transaction records cannot be accessed. Utilizing simulated customer and transaction data, the study develops an unsupervised, graph-based framework for client segmentation. Account behavior is modeled through time series derived from financial data, from which conditional dependencies are inferred using the Sparse QUadratic approximation for Inverse Covariance (SQUIC) and SQUIC-Fit algorithms. These dependencies are represented as graphs and multiple clustering methods are applied to uncover communities of behaviorally similar clients. The evaluation shows that the proposed framework can generate client segments in settings where only aggregate balance data are available, simulating scenarios in which detailed transaction records cannot be accessed. The approach further demonstrates scalability across datasets of increasing size, indicating potential for application in realistic financial contexts.
Client segmentation is a fundamental task in financial services, enabling institutions to tailor products, enhance customer satisfaction and strengthen risk management. Traditional segmentation approaches often rely on static demographic or firmographic attributes, which fail to capture the behavioral diversity of clients. To address this limitation, this thesis advances customer segmentation methodologies by leveraging innovative techniques in time-series clustering.
-
AMLSim datasets with 100-1K-10K-100K users can be downloaded from here
- AMLSim100: 100 clients and 10,000 transactions;
- AMLSim1K: 1,000 clients and 100,000 transactions;
- AMLSim10K: 10,000 clients and 1,000,000 transactions;
- AMLSim100K: 100,000 clients and 10,000,000 transactions.
-
Data generated with PaySim tool can be downloaded from here
- PaySim100: 111 clients and 12,492 transactions;
- PaySim1K: 1,026 clients and 103,884 transactions;
- PaySim10K: 10,284 clients and 1,100,726 transactions;
- PaySim100K: 102,249 clients and 10,900,690 transactions.
-
Install SQUIC
Follow the SQUIC User Manual to install SQUIC Library.
-
Install all required dependencies
Python 3.10 or higher is required, then install all required dependencies with:
pip install -r requirements.txt -
Download the Dataset
Use the links provided in Data Sources section to download the datasets. -
Setup the Folder Structure
In the root directory of the project, create a folder namedDatasetswith the following structure:FinancialCrimeModels ├── experiment1.py ├── experiment2_paysim.py ├── ... ├── Datasets ├── AMLSim ├── 100 users (with inside the csv files) ├── 1K users (with inside the csv files) ├── 10K users (with inside the csv files) ├── 100K users (with inside the csv files) ├── PaySim ├── 100 users (with inside the csv files) ├── 1K users (with inside the csv files) ├── 10K users (with inside the csv files) ├── 100K users (with inside the csv files)Each subfolder (e.g.,
100,1K, etc.) should contain the corresponding.csvfiles from the dataset. -
Run the Program
You can run client segmentation with multiple clustering methods by executing:python experiment1.py
To run client segmentation with Spectral Clustering on PaySim (fixed number of clusters), execute:
python experiment2_paysim.py
A corresponding Jupyter Notebook (
.ipynb) is provided for each experiment, as graph visualization with the Cosmograph tool is supported only in notebooks and not in.pyfiles. -
Input the Dataset Name
When prompted, input the dataset name using the following format:Name_Dimension, for exampleAMLSim_100orPaySim_100(the input is not case sensitive). -
View Results
Forexperiment1.py(AMLSim/PaySim – Multiple Clustering Methods) output will look like:For lambda = 0.01: louvain: PDensity = 0.3, Q = 0.3, nCluster = 4, nIsolated = 0 leiden: PDensity = 0.33, Q = 0.27, nCluster = 5, nIsolated = 0 dbscan: PDensity = 0.23, Q = -0.0, nCluster = 1, nIsolated = 0 spectral: PDensity = 0.23, Q = -0.0, nCluster = 1, nIsolated = 0 ...PDensity: Average density between clustersQ: Modularity scorenCluster: Number of clusters detectednIsolated: Number of isolated nodes
For
experiment2_paysimoutput will look like:For lambda = 0.6 : 'nCluster': 2, 'ARI': -0.03, 'f1': 0.58 For lambda = 0.5 : 'nCluster': 2, 'ARI': 1.0, 'f1': 1.0 For lambda = 0.4 : 'nCluster': 2, 'ARI': 1.0, 'f1': 1.0 ...ARI: Adjusted Rand Indexf1: F1 ScorenCluster: Number of clusters (should be 2)
A plot is also generated to illustrate how the metrics evolve with different regularization parameters.
When running the Jupyter Notebooks, an additional visualization of the graph is provided, with communities highlighted in different colors.






