SageMaker BYOC Distributed Training

A complete example of distributed training on Amazon SageMaker using Bring Your Own Container (BYOC) with PyTorch Lightning and multi-node GPU training.

Overview

This project demonstrates how to:

Build a custom Docker container for distributed training
Use PyTorch Lightning with SageMaker's distributed training capabilities
Train an autoencoder model on MNIST dataset across multiple GPUs/nodes
Handle SageMaker's distributed training configuration automatically
Deploy and manage training jobs with hyperparameter tuning

Architecture

The solution uses:

PyTorch Lightning for distributed training orchestration
NVIDIA CUDA 11.8 base image for GPU support
SageMaker Training Jobs for managed infrastructure
Amazon ECR for container registry
DDP (Distributed Data Parallel) strategy for multi-GPU training

Project Structure

├── train.py                    # Main training script with Lightning model
├── Dockerfile                  # Container definition
├── training_script.ipynb       # Complete workflow notebook
└── README.md                   # This file

Model Details

The project implements a simple autoencoder using PyTorch Lightning:

Input: MNIST images (28x28 pixels)
Encoder: 784 → 128 → 3 (bottleneck)
Decoder: 3 → 128 → 784
Loss: MSE reconstruction loss
Optimizer: Adam

Key Features

Distributed Training Support

Automatically detects SageMaker distributed configuration
Handles multi-node, multi-GPU training
Configures NCCL for efficient GPU communication
Falls back to local training when not in SageMaker environment

Hyperparameter Configuration

Supports configurable hyperparameters:

epochs: Number of training epochs (default: 5)
learning-rate: Learning rate (default: 0.001)
batch-size: Batch size (default: 32)

SageMaker Integration

Reads training data from /opt/ml/input/data/train
Saves model artifacts to /opt/ml/model
Outputs logs to /opt/ml/output/data
Handles SageMaker's resource configuration automatically

Quick Start

Prerequisites

AWS CLI configured with appropriate permissions
Docker installed and running
SageMaker execution role with necessary permissions

1. Build and Push Container

# Set your AWS account details
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REGION=$(aws configure get region)
REPO_NAME="sagemaker-pytorch-lightning-distributed-training"

# Create ECR repository
aws ecr create-repository --repository-name $REPO_NAME --region $REGION

# Build and push Docker image
aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com

docker build -t $REPO_NAME .
docker tag $REPO_NAME:latest $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPO_NAME:latest
docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPO_NAME:latest

2. Run Training Job

Use the provided Jupyter notebook (training_script.ipynb) which includes:

ECR repository setup
Container build and push
SageMaker training job configuration
Both managed and local mode examples
Hyperparameter tuning setup

3. Local Testing

For local development and testing:

from sagemaker.local import LocalSession
from sagemaker.estimator import Estimator

sagemaker_session = LocalSession()
sagemaker_session.config = {'local': {'local_code': True}}

estimator = Estimator(
    image_uri=f"{account_id}.dkr.ecr.{region}.amazonaws.com/{repo}:latest",
    role="arn:aws:iam::111111111111:role/service-role/AmazonSageMaker-ExecutionRole-20200101T000001",
    instance_count=1,
    instance_type='local',
    sagemaker_session=sagemaker_session
)

estimator.fit({'train': "file://./data/"}, wait=True)

Training Configuration

Multi-Node Setup

The training script automatically configures distributed training when running on SageMaker:

# Automatic configuration from SageMaker resource config
os.environ["MASTER_ADDR"] = resource_config['hosts'][0]
os.environ["MASTER_PORT"] = "50001"
os.environ["NODE_RANK"] = str(resource_config['hosts'].index(resource_config['current_host']))
os.environ['WORLD_SIZE'] = str(len(resource_config['hosts']) * torch.cuda.device_count())

Lightning Trainer Configuration

trainer = L.Trainer(
    accelerator='auto',
    devices=num_core,      # GPUs per node
    num_nodes=num_node,    # Number of nodes
    strategy='ddp',        # Distributed Data Parallel
    max_epochs=epochs,
    default_root_dir='/opt/ml/output/data'
)

Hyperparameter Tuning

The notebook includes an example of hyperparameter tuning:

from sagemaker.tuner import HyperparameterTuner

hyperparameter_ranges = {
    'learning-rate': ContinuousParameter(0.0001, 0.01),
    'batch-size': CategoricalParameter([16, 32, 64, 128])
}

tuner = HyperparameterTuner(
    estimator=estimator,
    objective_metric_name='Train Loss',
    hyperparameter_ranges=hyperparameter_ranges,
    max_jobs=10,
    max_parallel_jobs=2
)

Monitoring and Logging

Training metrics are logged via Lightning's built-in logging
SageMaker captures stdout/stderr for CloudWatch Logs
Model artifacts are automatically saved to S3
Use estimator.logs() to view training progress

Troubleshooting

Common Issues

NCCL Connection Issues: Ensure NCCL_SOCKET_IFNAME is set correctly
Invalid Rank Errors: Verify devices and num_nodes match your instance configuration
Permission Errors: Check SageMaker execution role has necessary S3 and ECR permissions

Debug Mode

Set environment variables for debugging:

hyperparameters = {
    'epochs': 1,
    'learning-rate': 0.001,
    'batch-size': 32
}

Cost Optimization

Use Spot instances for training jobs to reduce costs
Consider ml.p3.2xlarge for single-GPU training
Use ml.p3.8xlarge or ml.p3.16xlarge for multi-GPU training
Implement early stopping to avoid unnecessary training time

Contributing

Fork the repository
Create a feature branch
Make your changes
Test with both local and SageMaker modes
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SageMaker BYOC Distributed Training

Overview

Architecture

Project Structure

Model Details

Key Features

Distributed Training Support

Hyperparameter Configuration

SageMaker Integration

Quick Start

Prerequisites

1. Build and Push Container

2. Run Training Job

3. Local Testing

Training Configuration

Multi-Node Setup

Lightning Trainer Configuration

Hyperparameter Tuning

Monitoring and Logging

Troubleshooting

Common Issues

Debug Mode

Cost Optimization

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Dockerfile		Dockerfile
README.md		README.md
train.py		train.py
training_script.ipynb		training_script.ipynb

heichow/sagemaker-byoc-distributed-training

Folders and files

Latest commit

History

Repository files navigation

SageMaker BYOC Distributed Training

Overview

Architecture

Project Structure

Model Details

Key Features

Distributed Training Support

Hyperparameter Configuration

SageMaker Integration

Quick Start

Prerequisites

1. Build and Push Container

2. Run Training Job

3. Local Testing

Training Configuration

Multi-Node Setup

Lightning Trainer Configuration

Hyperparameter Tuning

Monitoring and Logging

Troubleshooting

Common Issues

Debug Mode

Cost Optimization

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages