This is the official repository of the paper: Systematic benchmarking of clustering methods on complete and incomplete multi-modal data.
We present a comprehensive roadmap of recommendations for guiding an optimal clustering analysis of multi-modal datasets, tailored to the specific characteristics of the data. Even if clustering is one of the most common data analysis tasks on multi-modal settings, no clear guidelines exist on how to choose the best algorithm to partition a given dataset into clusters. To fill this gap, we conducted a systematic empirical benchmarking study using 20 multi-modal datasets, both fully observed and with modality-wise missing data. More than a million clustering models were generated and evaluated from multiple perspectives (ground-truth label accuracy, cluster stability, robustness to missing data, cluster structure and computational efficiency) using diverse and robust metrics. Our findings highlighted that IMSR and SNF deliver overall the best performance on complete datasets, while NEMO, PIMVC, and IMSR were the most effective methods when working with incomplete multi-modal data. All the results have been made publicly available through an interactive web-resource: https://mmcbench.netlify.app.
To reproduce our results, please follow the steps below.
We used three different programming languages in our benchmarking:
- Version:
6.4.0 - Required packages:
statistics,control.
- Version:
4.1.2 - Required package:
nnTensor==1.2.0.
- Version:
3.10 - Install one by one all Python dependencies in the file using
pip install: requirements.txt
To obtain and organize the datasets:
- Follow the instructions links and citations in the paper to download all datasets.
- Create in the dataset forlder a subfolder for each dataset (use the dataset names listed in aux_data/dataset_table.csv as the names for subfolders) and place the dataset files. Use the naming convention: {dataset_name}_{n_modality}.csv for each modality and {dataset_name}_y.csv for the target. For example: dataset/nutrimouse/nutrimouse_0.csv
- You can find examples of the expected folder and file structure in the datasets directory.
These masks simulate incomplete multi-modal data:
python src/scripts/generating_indxs.py -save_resultsTo measure computational resources and time:
python src/scripts/time_evaluation.py -save_resultspython src/scripts/complete_algorithms_evaluation.py -save_results -Python -Matlab -R -DLpython src/scripts/incomplete_algorithms_evaluation.py -save_results -Python -Matlab -R -DLOnce the results are generated, extract the benchmarking metrics with:
python src/scripts/bench_metrics.pyIf you find this project useful for your project, please cite our paper.
BibTeX entry: