Skip to content

heichow/sagemaker-byoc-distributed-training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SageMaker BYOC Distributed Training

A complete example of distributed training on Amazon SageMaker using Bring Your Own Container (BYOC) with PyTorch Lightning and multi-node GPU training.

Overview

This project demonstrates how to:

  • Build a custom Docker container for distributed training
  • Use PyTorch Lightning with SageMaker's distributed training capabilities
  • Train an autoencoder model on MNIST dataset across multiple GPUs/nodes
  • Handle SageMaker's distributed training configuration automatically
  • Deploy and manage training jobs with hyperparameter tuning

Architecture

The solution uses:

  • PyTorch Lightning for distributed training orchestration
  • NVIDIA CUDA 11.8 base image for GPU support
  • SageMaker Training Jobs for managed infrastructure
  • Amazon ECR for container registry
  • DDP (Distributed Data Parallel) strategy for multi-GPU training

Project Structure

├── train.py                    # Main training script with Lightning model
├── Dockerfile                  # Container definition
├── training_script.ipynb       # Complete workflow notebook
└── README.md                   # This file

Model Details

The project implements a simple autoencoder using PyTorch Lightning:

  • Input: MNIST images (28x28 pixels)
  • Encoder: 784 → 128 → 3 (bottleneck)
  • Decoder: 3 → 128 → 784
  • Loss: MSE reconstruction loss
  • Optimizer: Adam

Key Features

Distributed Training Support

  • Automatically detects SageMaker distributed configuration
  • Handles multi-node, multi-GPU training
  • Configures NCCL for efficient GPU communication
  • Falls back to local training when not in SageMaker environment

Hyperparameter Configuration

Supports configurable hyperparameters:

  • epochs: Number of training epochs (default: 5)
  • learning-rate: Learning rate (default: 0.001)
  • batch-size: Batch size (default: 32)

SageMaker Integration

  • Reads training data from /opt/ml/input/data/train
  • Saves model artifacts to /opt/ml/model
  • Outputs logs to /opt/ml/output/data
  • Handles SageMaker's resource configuration automatically

Quick Start

Prerequisites

  • AWS CLI configured with appropriate permissions
  • Docker installed and running
  • SageMaker execution role with necessary permissions

1. Build and Push Container

# Set your AWS account details
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REGION=$(aws configure get region)
REPO_NAME="sagemaker-pytorch-lightning-distributed-training"

# Create ECR repository
aws ecr create-repository --repository-name $REPO_NAME --region $REGION

# Build and push Docker image
aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com

docker build -t $REPO_NAME .
docker tag $REPO_NAME:latest $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPO_NAME:latest
docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPO_NAME:latest

2. Run Training Job

Use the provided Jupyter notebook (training_script.ipynb) which includes:

  • ECR repository setup
  • Container build and push
  • SageMaker training job configuration
  • Both managed and local mode examples
  • Hyperparameter tuning setup

3. Local Testing

For local development and testing:

from sagemaker.local import LocalSession
from sagemaker.estimator import Estimator

sagemaker_session = LocalSession()
sagemaker_session.config = {'local': {'local_code': True}}

estimator = Estimator(
    image_uri=f"{account_id}.dkr.ecr.{region}.amazonaws.com/{repo}:latest",
    role="arn:aws:iam::111111111111:role/service-role/AmazonSageMaker-ExecutionRole-20200101T000001",
    instance_count=1,
    instance_type='local',
    sagemaker_session=sagemaker_session
)

estimator.fit({'train': "file://./data/"}, wait=True)

Training Configuration

Multi-Node Setup

The training script automatically configures distributed training when running on SageMaker:

# Automatic configuration from SageMaker resource config
os.environ["MASTER_ADDR"] = resource_config['hosts'][0]
os.environ["MASTER_PORT"] = "50001"
os.environ["NODE_RANK"] = str(resource_config['hosts'].index(resource_config['current_host']))
os.environ['WORLD_SIZE'] = str(len(resource_config['hosts']) * torch.cuda.device_count())

Lightning Trainer Configuration

trainer = L.Trainer(
    accelerator='auto',
    devices=num_core,      # GPUs per node
    num_nodes=num_node,    # Number of nodes
    strategy='ddp',        # Distributed Data Parallel
    max_epochs=epochs,
    default_root_dir='/opt/ml/output/data'
)

Hyperparameter Tuning

The notebook includes an example of hyperparameter tuning:

from sagemaker.tuner import HyperparameterTuner

hyperparameter_ranges = {
    'learning-rate': ContinuousParameter(0.0001, 0.01),
    'batch-size': CategoricalParameter([16, 32, 64, 128])
}

tuner = HyperparameterTuner(
    estimator=estimator,
    objective_metric_name='Train Loss',
    hyperparameter_ranges=hyperparameter_ranges,
    max_jobs=10,
    max_parallel_jobs=2
)

Monitoring and Logging

  • Training metrics are logged via Lightning's built-in logging
  • SageMaker captures stdout/stderr for CloudWatch Logs
  • Model artifacts are automatically saved to S3
  • Use estimator.logs() to view training progress

Troubleshooting

Common Issues

  1. NCCL Connection Issues: Ensure NCCL_SOCKET_IFNAME is set correctly
  2. Invalid Rank Errors: Verify devices and num_nodes match your instance configuration
  3. Permission Errors: Check SageMaker execution role has necessary S3 and ECR permissions

Debug Mode

Set environment variables for debugging:

hyperparameters = {
    'epochs': 1,
    'learning-rate': 0.001,
    'batch-size': 32
}

Cost Optimization

  • Use Spot instances for training jobs to reduce costs
  • Consider ml.p3.2xlarge for single-GPU training
  • Use ml.p3.8xlarge or ml.p3.16xlarge for multi-GPU training
  • Implement early stopping to avoid unnecessary training time

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test with both local and SageMaker modes
  5. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages