A complete example of distributed training on Amazon SageMaker using Bring Your Own Container (BYOC) with PyTorch Lightning and multi-node GPU training.
This project demonstrates how to:
- Build a custom Docker container for distributed training
- Use PyTorch Lightning with SageMaker's distributed training capabilities
- Train an autoencoder model on MNIST dataset across multiple GPUs/nodes
- Handle SageMaker's distributed training configuration automatically
- Deploy and manage training jobs with hyperparameter tuning
The solution uses:
- PyTorch Lightning for distributed training orchestration
- NVIDIA CUDA 11.8 base image for GPU support
- SageMaker Training Jobs for managed infrastructure
- Amazon ECR for container registry
- DDP (Distributed Data Parallel) strategy for multi-GPU training
├── train.py # Main training script with Lightning model
├── Dockerfile # Container definition
├── training_script.ipynb # Complete workflow notebook
└── README.md # This file
The project implements a simple autoencoder using PyTorch Lightning:
- Input: MNIST images (28x28 pixels)
- Encoder: 784 → 128 → 3 (bottleneck)
- Decoder: 3 → 128 → 784
- Loss: MSE reconstruction loss
- Optimizer: Adam
- Automatically detects SageMaker distributed configuration
- Handles multi-node, multi-GPU training
- Configures NCCL for efficient GPU communication
- Falls back to local training when not in SageMaker environment
Supports configurable hyperparameters:
epochs: Number of training epochs (default: 5)learning-rate: Learning rate (default: 0.001)batch-size: Batch size (default: 32)
- Reads training data from
/opt/ml/input/data/train - Saves model artifacts to
/opt/ml/model - Outputs logs to
/opt/ml/output/data - Handles SageMaker's resource configuration automatically
- AWS CLI configured with appropriate permissions
- Docker installed and running
- SageMaker execution role with necessary permissions
# Set your AWS account details
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REGION=$(aws configure get region)
REPO_NAME="sagemaker-pytorch-lightning-distributed-training"
# Create ECR repository
aws ecr create-repository --repository-name $REPO_NAME --region $REGION
# Build and push Docker image
aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com
docker build -t $REPO_NAME .
docker tag $REPO_NAME:latest $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPO_NAME:latest
docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPO_NAME:latestUse the provided Jupyter notebook (training_script.ipynb) which includes:
- ECR repository setup
- Container build and push
- SageMaker training job configuration
- Both managed and local mode examples
- Hyperparameter tuning setup
For local development and testing:
from sagemaker.local import LocalSession
from sagemaker.estimator import Estimator
sagemaker_session = LocalSession()
sagemaker_session.config = {'local': {'local_code': True}}
estimator = Estimator(
image_uri=f"{account_id}.dkr.ecr.{region}.amazonaws.com/{repo}:latest",
role="arn:aws:iam::111111111111:role/service-role/AmazonSageMaker-ExecutionRole-20200101T000001",
instance_count=1,
instance_type='local',
sagemaker_session=sagemaker_session
)
estimator.fit({'train': "file://./data/"}, wait=True)The training script automatically configures distributed training when running on SageMaker:
# Automatic configuration from SageMaker resource config
os.environ["MASTER_ADDR"] = resource_config['hosts'][0]
os.environ["MASTER_PORT"] = "50001"
os.environ["NODE_RANK"] = str(resource_config['hosts'].index(resource_config['current_host']))
os.environ['WORLD_SIZE'] = str(len(resource_config['hosts']) * torch.cuda.device_count())trainer = L.Trainer(
accelerator='auto',
devices=num_core, # GPUs per node
num_nodes=num_node, # Number of nodes
strategy='ddp', # Distributed Data Parallel
max_epochs=epochs,
default_root_dir='/opt/ml/output/data'
)The notebook includes an example of hyperparameter tuning:
from sagemaker.tuner import HyperparameterTuner
hyperparameter_ranges = {
'learning-rate': ContinuousParameter(0.0001, 0.01),
'batch-size': CategoricalParameter([16, 32, 64, 128])
}
tuner = HyperparameterTuner(
estimator=estimator,
objective_metric_name='Train Loss',
hyperparameter_ranges=hyperparameter_ranges,
max_jobs=10,
max_parallel_jobs=2
)- Training metrics are logged via Lightning's built-in logging
- SageMaker captures stdout/stderr for CloudWatch Logs
- Model artifacts are automatically saved to S3
- Use
estimator.logs()to view training progress
- NCCL Connection Issues: Ensure
NCCL_SOCKET_IFNAMEis set correctly - Invalid Rank Errors: Verify
devicesandnum_nodesmatch your instance configuration - Permission Errors: Check SageMaker execution role has necessary S3 and ECR permissions
Set environment variables for debugging:
hyperparameters = {
'epochs': 1,
'learning-rate': 0.001,
'batch-size': 32
}- Use Spot instances for training jobs to reduce costs
- Consider
ml.p3.2xlargefor single-GPU training - Use
ml.p3.8xlargeorml.p3.16xlargefor multi-GPU training - Implement early stopping to avoid unnecessary training time
- Fork the repository
- Create a feature branch
- Make your changes
- Test with both local and SageMaker modes
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.