Systematic benchmarking of clustering methods on complete and incomplete multi-modal data

This is the official repository of the paper: Systematic benchmarking of clustering methods on complete and incomplete multi-modal data.

Abstract

We present a comprehensive roadmap of recommendations for guiding an optimal clustering analysis of multi-modal datasets, tailored to the specific characteristics of the data. Even if clustering is one of the most common data analysis tasks on multi-modal settings, no clear guidelines exist on how to choose the best algorithm to partition a given dataset into clusters. To fill this gap, we conducted a systematic empirical benchmarking study using 20 multi-modal datasets, both fully observed and with modality-wise missing data. More than a million clustering models were generated and evaluated from multiple perspectives (ground-truth label accuracy, cluster stability, robustness to missing data, cluster structure and computational efficiency) using diverse and robust metrics. Our findings highlighted that IMSR and SNF deliver overall the best performance on complete datasets, while NEMO, PIMVC, and IMSR were the most effective methods when working with incomplete multi-modal data. All the results have been made publicly available through an interactive web-resource: https://mmcbench.netlify.app.

Reproducibility

To reproduce our results, please follow the steps below.

Set up the environment

We used three different programming languages in our benchmarking:

Octave

Version: 6.4.0
Required packages: statistics, control.

R

Version: 4.1.2
Required package: nnTensor==1.2.0.

Python

Version: 3.10
Install one by one all Python dependencies in the file using pip install: requirements.txt

Download the datasets

To obtain and organize the datasets:

Follow the instructions links and citations in the paper to download all datasets.
Create in the dataset forlder a subfolder for each dataset (use the dataset names listed in aux_data/dataset_table.csv as the names for subfolders) and place the dataset files. Use the naming convention: {dataset_name}_{n_modality}.csv for each modality and {dataset_name}_y.csv for the target. For example: dataset/nutrimouse/nutrimouse_0.csv
You can find examples of the expected folder and file structure in the datasets directory.

Generate missing data masks

These masks simulate incomplete multi-modal data:

python src/scripts/generating_indxs.py -save_results

Evaluate runtime performance

To measure computational resources and time:

python src/scripts/time_evaluation.py -save_results

Run the benchmark

For complete multi-modal data:

python src/scripts/complete_algorithms_evaluation.py -save_results -Python -Matlab -R -DL

For incomplete multi-modal data:

python src/scripts/incomplete_algorithms_evaluation.py -save_results -Python -Matlab -R -DL

Compute metrics

Once the results are generated, extract the benchmarking metrics with:

python src/scripts/bench_metrics.py

Citing the paper

If you find this project useful for your project, please cite our paper.

BibTeX entry:

Name		Name	Last commit message	Last commit date
Latest commit History 510 Commits
.idea		.idea
aux_data		aux_data
datasets		datasets
imml		imml
imvc		imvc
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
settings.py		settings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Systematic benchmarking of clustering methods on complete and incomplete multi-modal data

Abstract

Reproducibility

Set up the environment

Octave

R

Python

Download the datasets

Generate missing data masks

Evaluate runtime performance

Run the benchmark

For complete multi-modal data:

For incomplete multi-modal data:

Compute metrics

Citing the paper

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ocbe-uio/mmcbench

Folders and files

Latest commit

History

Repository files navigation

Systematic benchmarking of clustering methods on complete and incomplete multi-modal data

Abstract

Reproducibility

Set up the environment

Octave

R

Python

Download the datasets

Generate missing data masks

Evaluate runtime performance

Run the benchmark

For complete multi-modal data:

For incomplete multi-modal data:

Compute metrics

Citing the paper

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages