SLURM Array Tutorial: Running Parameterized Experiments on a Cluster

Welcome! This repository demonstrates how to run parameterized experiments on a cluster using SLURM arrays. You'll learn how to structure your code, generate experiment configurations, and launch jobs efficiently—ideal for machine learning, data science, or any batch computation.

Overview

SLURM arrays allow you to run many jobs in parallel, each with different parameters. This is perfect for hyperparameter sweeps or large-scale experiments.

Key idea:

Store experiment parameters in a CSV file.
Each SLURM array job reads one row (using its array index) and runs with those parameters.

Best Practices

Parameterize everything: Use argument parsers for all experiment settings.
Save default configs: Store your parser's defaults for reproducibility.
Generate CSVs programmatically: Avoid manual editing—use scripts!
Keep code modular: Separate parsing, experiment creation, and job logic.
Log outputs and errors: Use SLURM's logging features for debugging.
Version your experiment configs: Keep CSVs and scripts under version control.

How It Works

Argument Parser:
Define all experiment parameters in a Python parser.
CSV Generation:
Create a CSV where each row is a unique experiment configuration.
SLURM Array Job:
Each job reads its row from the CSV (using $SLURM_ARRAY_TASK_ID) and runs with those parameters.

Step-by-Step Usage

1. Argument Parsing

Define your experiment parameters in main_parser():

import argparse

def main_parser():
    parser = argparse.ArgumentParser(description="Training Script")
    parser.add_argument("--model_name", type=str, default="resnet50")
    parser.add_argument("--batch_size", type=int, default=32)
    parser.add_argument("--learning_rate", type=float, default=1e-3)
    # ... add more parameters ...
    parser.add_argument("--SLURM_ARRAY_TASK_ID", type=int, default=-1)
    parser.add_argument("--df_name", type=str, default="study_exp_main")
    return parser.parse_args()

2. Generating Experiment Configurations

Use the ExperimentCreator class to generate a CSV of experiment settings:

from parser import main_parser, return_args_parser_exp
from utils import load_pickle_dict

class ExperimentCreator:
    # ... see code for full implementation ...
    def create_study_experiment(self, config, output_path='./study_exp_main.csv'):
        # Validates and saves experiment configs to CSV
        pass

def main():
    num_samples = 100
    config = {
        'learning_rate': [0.01 * i for i in range(1, num_samples + 1)],
        'batch_size': [2**(2+i) for i in range(1, num_samples + 1)],
    }
    experiment_creator = ExperimentCreator(num_samples=num_samples, parser_type='main')
    experiment_creator.create_study_experiment(config=config)

if __name__ == "__main__":
    main()

3. Running Locally vs. on Cluster

Use a wrapper to load parameters from the CSV if running on the cluster:

def return_args_parser_exp(save_dict_=True, parser=None, name='main'):
    args = parser()
    if save_dict_:
        save_dict(vars(args), f'./default_config_dict_{name}')
    if args.SLURM_ARRAY_TASK_ID != -1:
        df = pd.read_csv(f'./{args.df_name}.csv')
        dict_record = df.to_dict('index')[args.SLURM_ARRAY_TASK_ID]
        dict_record["SLURM_ARRAY_TASK_ID"] = args.SLURM_ARRAY_TASK_ID
        args = Namespace(**dict_record)
    return args

4. SLURM Array Job Scripts

CPU Example:

#!/bin/bash
#SBATCH -p Serveurs-CPU
#SBATCH -J cpu_array_test
#SBATCH -c 4
#SBATCH --mem 8000
#SBATCH --error log/cpu_array_test%A_%a.txt
#SBATCH --output log/cpu_array_test%A_%a.out
#SBATCH --array=[0-49]%25

srun singularity run /path/to/image.sif python /path/to/main_cpu.py --SLURM_ARRAY_TASK_ID $SLURM_ARRAY_TASK_ID --df_name study_exp_main
echo "Job $SLURM_JOB_ID with array ID $SLURM_ARRAY_TASK_ID has completed."

GPU Example:

#!/bin/bash
#SBATCH -p GPU48Go
#SBATCH --gres=gpu
#SBATCH -J gpu_array_test
#SBATCH -c 4
#SBATCH --mem 8000
#SBATCH --error log/gpu_array_test%A_%a.txt
#SBATCH --output log/gpu_array_test%A_%a.out
#SBATCH --array=[0-50]%5

srun singularity run --nv /path/to/image.sif python /path/to/main_gpu.py --SLURM_ARRAY_TASK_ID $SLURM_ARRAY_TASK_ID --df_name study_exp_main
echo "Job $SLURM_JOB_ID with array ID $SLURM_ARRAY_TASK_ID has completed."

Note:

Change the singularity image and script paths to match your cluster setup.
Adjust SLURM parameters (-p, --mem, etc.) as needed.

Example Scripts

main_cpu.py: Example CPU computation using parsed arguments.
main_gpu.py: Example GPU computation using parsed arguments.
sbatch_job_create.sh: Script to generate the experiment CSV.

Tips & Troubleshooting

Check your CSV: Make sure the number of rows matches your SLURM array range.
Log everything: Use SLURM's --output and --error flags.
Test locally: Run with --SLURM_ARRAY_TASK_ID -1 before submitting to the cluster.
Singularity images: Ensure your Python environment and dependencies are included.
Cluster-specific settings: Always adapt SLURM flags to your cluster's requirements.

Credits

Created for a YouTube tutorial by [your channel name].
Feel free to fork, adapt, and share!

Happy experimenting! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
create_exp.py		create_exp.py
main_cpu.py		main_cpu.py
main_gpu.py		main_gpu.py
parser.py		parser.py
readme.md		readme.md
sbatch_job_array_cpu.sh		sbatch_job_array_cpu.sh
sbatch_job_array_gpu.sh		sbatch_job_array_gpu.sh
sbatch_job_create_exp.sh		sbatch_job_create_exp.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SLURM Array Tutorial: Running Parameterized Experiments on a Cluster

Table of Contents

Overview

Best Practices

How It Works

Step-by-Step Usage

1. Argument Parsing

2. Generating Experiment Configurations

3. Running Locally vs. on Cluster

4. SLURM Array Job Scripts

Example Scripts

Tips & Troubleshooting

Credits

About

Uh oh!

Releases

Packages

Languages

YouvenZ/Slurm-Array-jobs

Folders and files

Latest commit

History

Repository files navigation

SLURM Array Tutorial: Running Parameterized Experiments on a Cluster

Table of Contents

Overview

Best Practices

How It Works

Step-by-Step Usage

1. Argument Parsing

2. Generating Experiment Configurations

3. Running Locally vs. on Cluster

4. SLURM Array Job Scripts

Example Scripts

Tips & Troubleshooting

Credits

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages