Skip to content

ocbe-uio/mmcbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

510 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Systematic benchmarking of clustering methods on complete and incomplete multi-modal data

This is the official repository of the paper: Systematic benchmarking of clustering methods on complete and incomplete multi-modal data.

Abstract

We present a comprehensive roadmap of recommendations for guiding an optimal clustering analysis of multi-modal datasets, tailored to the specific characteristics of the data. Even if clustering is one of the most common data analysis tasks on multi-modal settings, no clear guidelines exist on how to choose the best algorithm to partition a given dataset into clusters. To fill this gap, we conducted a systematic empirical benchmarking study using 20 multi-modal datasets, both fully observed and with modality-wise missing data. More than a million clustering models were generated and evaluated from multiple perspectives (ground-truth label accuracy, cluster stability, robustness to missing data, cluster structure and computational efficiency) using diverse and robust metrics. Our findings highlighted that IMSR and SNF deliver overall the best performance on complete datasets, while NEMO, PIMVC, and IMSR were the most effective methods when working with incomplete multi-modal data. All the results have been made publicly available through an interactive web-resource: https://mmcbench.netlify.app.

Reproducibility

To reproduce our results, please follow the steps below.

Set up the environment

We used three different programming languages in our benchmarking:

Octave

  • Version: 6.4.0
  • Required packages: statistics, control.

R

  • Version: 4.1.2
  • Required package: nnTensor==1.2.0.

Python

  • Version: 3.10
  • Install one by one all Python dependencies in the file using pip install: requirements.txt

Download the datasets

To obtain and organize the datasets:

  • Follow the instructions links and citations in the paper to download all datasets.
  • Create in the dataset forlder a subfolder for each dataset (use the dataset names listed in aux_data/dataset_table.csv as the names for subfolders) and place the dataset files. Use the naming convention: {dataset_name}_{n_modality}.csv for each modality and {dataset_name}_y.csv for the target. For example: dataset/nutrimouse/nutrimouse_0.csv
  • You can find examples of the expected folder and file structure in the datasets directory.

Generate missing data masks

These masks simulate incomplete multi-modal data:

python src/scripts/generating_indxs.py -save_results

Evaluate runtime performance

To measure computational resources and time:

python src/scripts/time_evaluation.py -save_results

Run the benchmark

For complete multi-modal data:

python src/scripts/complete_algorithms_evaluation.py -save_results -Python -Matlab -R -DL

For incomplete multi-modal data:

python src/scripts/incomplete_algorithms_evaluation.py -save_results -Python -Matlab -R -DL

Compute metrics

Once the results are generated, extract the benchmarking metrics with:

python src/scripts/bench_metrics.py

Citing the paper

If you find this project useful for your project, please cite our paper.

BibTeX entry:

About

No description, website, or topics provided.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published