Skip to content

M.Sc. Thesis Repo - Enhancing Financial Client Segmentation Models through Time-Series Clustering

License

Notifications You must be signed in to change notification settings

cricci3/FinancialCrimeModels

Repository files navigation

Enhancing Financial Client Segmentation Models through Time-Series Clustering

This repository contains the official code for the Master thesis "Enhancing Financial Client Segmentation Models through Time-Series Clustering".

timeseries leiden_02

In this work, we propose a framework that can generate client segments in settings where only aggregate balance data are available, simulating scenarios in which detailed transaction records cannot be accessed. Utilizing simulated customer and transaction data, the study develops an unsupervised, graph-based framework for client segmentation. Account behavior is modeled through time series derived from financial data, from which conditional dependencies are inferred using the Sparse QUadratic approximation for Inverse Covariance (SQUIC) and SQUIC-Fit algorithms. These dependencies are represented as graphs and multiple clustering methods are applied to uncover communities of behaviorally similar clients. The evaluation shows that the proposed framework can generate client segments in settings where only aggregate balance data are available, simulating scenarios in which detailed transaction records cannot be accessed. The approach further demonstrates scalability across datasets of increasing size, indicating potential for application in realistic financial contexts.

Motivation & contributions

Client segmentation is a fundamental task in financial services, enabling institutions to tailor products, enhance customer satisfaction and strengthen risk management. Traditional segmentation approaches often rely on static demographic or firmographic attributes, which fail to capture the behavioral diversity of clients. To address this limitation, this thesis advances customer segmentation methodologies by leveraging innovative techniques in time-series clustering.

Data Sources

  • AMLSim datasets with 100-1K-10K-100K users can be downloaded from here

    • AMLSim100: 100 clients and 10,000 transactions;
    • AMLSim1K: 1,000 clients and 100,000 transactions;
    • AMLSim10K: 10,000 clients and 1,000,000 transactions;
    • AMLSim100K: 100,000 clients and 10,000,000 transactions.
  • Data generated with PaySim tool can be downloaded from here

    • PaySim100: 111 clients and 12,492 transactions;
    • PaySim1K: 1,026 clients and 103,884 transactions;
    • PaySim10K: 10,284 clients and 1,100,726 transactions;
    • PaySim100K: 102,249 clients and 10,900,690 transactions.

How to Run

  1. Install SQUIC

    Follow the SQUIC User Manual to install SQUIC Library.

  2. Install all required dependencies

    Python 3.10 or higher is required, then install all required dependencies with:

    pip install -r requirements.txt
    
  3. Download the Dataset
    Use the links provided in Data Sources section to download the datasets.

  4. Setup the Folder Structure
    In the root directory of the project, create a folder named Datasets with the following structure:

    FinancialCrimeModels
    ├── experiment1.py
    ├── experiment2_paysim.py
    ├── ...
    ├── Datasets
        ├── AMLSim
            ├── 100 users (with inside the csv files)
            ├── 1K users (with inside the csv files)
            ├── 10K users (with inside the csv files)
            ├── 100K users (with inside the csv files)
        ├── PaySim
            ├── 100 users (with inside the csv files)
            ├── 1K users (with inside the csv files)
            ├── 10K users (with inside the csv files)
            ├── 100K users (with inside the csv files)
    

    Each subfolder (e.g., 100, 1K, etc.) should contain the corresponding .csv files from the dataset.

  5. Run the Program
    You can run client segmentation with multiple clustering methods by executing:

    python experiment1.py

    To run client segmentation with Spectral Clustering on PaySim (fixed number of clusters), execute:

    python experiment2_paysim.py

    A corresponding Jupyter Notebook (.ipynb) is provided for each experiment, as graph visualization with the Cosmograph tool is supported only in notebooks and not in .py files.

  6. Input the Dataset Name
    When prompted, input the dataset name using the following format: Name_Dimension, for example AMLSim_100 or PaySim_100 (the input is not case sensitive).

  7. View Results
    For experiment1.py (AMLSim/PaySim – Multiple Clustering Methods) output will look like:

    For lambda = 0.01:
        louvain:  PDensity = 0.3, Q = 0.3,  nCluster = 4, nIsolated = 0
        leiden:   PDensity = 0.33, Q = 0.27, nCluster = 5, nIsolated = 0
        dbscan:   PDensity = 0.23, Q = -0.0, nCluster = 1, nIsolated = 0
        spectral:   PDensity = 0.23, Q = -0.0, nCluster = 1, nIsolated = 0
     ...
    
    • PDensity: Average density between clusters
    • Q: Modularity score
    • nCluster: Number of clusters detected
    • nIsolated: Number of isolated nodes

    For experiment2_paysim output will look like:

    For lambda = 0.6 : 'nCluster': 2, 'ARI': -0.03, 'f1': 0.58
    For lambda = 0.5 : 'nCluster': 2, 'ARI': 1.0, 'f1': 1.0
    For lambda = 0.4 : 'nCluster': 2, 'ARI': 1.0, 'f1': 1.0
    ...
    
    • ARI: Adjusted Rand Index
    • f1: F1 Score
    • nCluster: Number of clusters (should be 2)

    A plot is also generated to illustrate how the metrics evolve with different regularization parameters.

    plot ARI_F1_squic-fit

When running the Jupyter Notebooks, an additional visualization of the graph is provided, with communities highlighted in different colors.

dbscan_02 louvain_02 cluster_03

About

M.Sc. Thesis Repo - Enhancing Financial Client Segmentation Models through Time-Series Clustering

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •