vLLM Neuron SageMaker BYOC (Bring Your Own Container)

Deploy DeepSeek R1 Distillation models on Amazon SageMaker using vLLM with AWS Inferentia2 chips through a custom Docker container.

Overview

This project provides a custom vLLM inference container for deploying DeepSeek R1 Distillation models on Amazon SageMaker Endpoints using AWS Inferentia2 instances. The solution leverages vLLM's Neuron support for high-performance inference on AWS's custom silicon.

Supported Models

Distillation Model	Base Model	Recommended Instance
DeepSeek-R1-Distill-Qwen-1.5B	Qwen2.5-Math-1.5B	ml.inf2.xlarge
DeepSeek-R1-Distill-Qwen-7B	Qwen2.5-Math-7B	ml.inf2.8xlarge
DeepSeek-R1-Distill-Llama-8B	Llama-3.1-8B	ml.inf2.8xlarge
DeepSeek-R1-Distill-Qwen-14B	Qwen2.5-14B	ml.inf2.8xlarge
DeepSeek-R1-Distill-Qwen-32B	Qwen2.5-32B	ml.inf2.24xlarge
DeepSeek-R1-Distill-Llama-70B	Llama-3.3-70B-Instruct	ml.inf2.48xlarge/ml.trn1.32xlarge

Architecture

The solution consists of:

FastAPI Application (main.py): Handles SageMaker inference requests
vLLM Engine: Optimized for Neuron inference on Inferentia2
Docker Container: Custom container with vLLM Neuron support
SageMaker Integration: Compatible with SageMaker hosting requirements

Prerequisites

AWS Account with SageMaker access
Inferentia2 Quota: Request quota increase for ml.inf2.* instances if needed
ECR Repository: Create a repository named sagemaker-neuron-container
S3 Bucket: Create a bucket for storing model weights (name must contain "sagemaker")
SageMaker Notebook: Recommended ml.t3.large with 100GB storage

Quick Start

1. Prepare the Environment

# Download and extract vLLM installation files
cd ~/SageMaker
wget https://zz-common.s3.us-east-1.amazonaws.com/tmp/install.tar
tar -xvf install.tar
cd install

# Clone vLLM repository
git clone https://github.com/vllm-project/vllm --branch v0.6.1.post2 --single-branch

# Copy customized files
cp arg_utils.py ./vllm/vllm/engine/
cp setup.py ./vllm/
cp neuron.py ./vllm/vllm/model_executor/model_loader/

2. Build Docker Image

cd ~/SageMaker

# Login to ECR
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com

# Build container (takes ~10 minutes)
docker build -t sagemaker-neuron-container:deepseek .

3. Push to ECR

# Replace with your AWS Account ID
account_id=<Your AWS Account ID>

aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin ${account_id}.dkr.ecr.us-west-2.amazonaws.com

docker tag sagemaker-neuron-container:deepseek ${account_id}.dkr.ecr.us-west-2.amazonaws.com/sagemaker-neuron-container:deepseek
docker push ${account_id}.dkr.ecr.us-west-2.amazonaws.com/sagemaker-neuron-container:deepseek

4. Upload Model to S3

# Install required packages
!pip install --upgrade huggingface_hub

# Download model (example: DeepSeek-R1-Distill-Qwen-7B)
from huggingface_hub import snapshot_download

model_id = 'deepseek-ai/DeepSeek-R1-Distill-Qwen-7B'
snapshot_download(repo_id=model_id, local_dir="./models/"+model_id)

# Upload to S3
local_path = "./models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/"
s3_bucket_name = "<YOUR BUCKET NAME>"
s3_path = f"s3://{s3_bucket_name}/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/"

!aws s3 sync {local_path} {s3_path}

5. Deploy to SageMaker

import boto3
import sagemaker 
import time

name = "sagemaker-vllm-neuron-qwen-7b-inf2"
role = sagemaker.get_execution_role()
sm_client = boto3.client(service_name='sagemaker')
account_id = boto3.client("sts").get_caller_identity()["Account"]
region = boto3.Session().region_name

image_url = f"{account_id}.dkr.ecr.{region}.amazonaws.com/sagemaker-neuron-container:deepseek"
s3_bucket_name = "<YOUR BUCKET NAME>"
model_url = f"s3://{s3_bucket_name}/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/"

# Create model
sm_client.create_model(
    ModelName=name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        'Image': image_url,
        "ModelDataSource": {
            "S3DataSource": {
                "S3Uri": model_url, 
                "S3DataType": "S3Prefix",
                "CompressionType": "None",
            },
        },
        'Environment': {
            'NUM_CORES': '2', 
            'BATCH_SIZE': '8',
            'SEQUENCE_LENGTH': '4096',
        }
    }
)

# Create endpoint config
sm_client.create_endpoint_config(
    EndpointConfigName=name,
    ProductionVariants=[{
        'InstanceType': 'ml.inf2.8xlarge',
        'InitialInstanceCount': 1,
        'ModelName': name,
        'VariantName': 'AllTraffic',
        "VolumeSizeInGB": 100,
        "ModelDataDownloadTimeoutInSeconds": 300,
        "ContainerStartupHealthCheckTimeoutInSeconds": 600
    }]
)

# Create endpoint
sm_client.create_endpoint(
    EndpointName=name,
    EndpointConfigName=name
)

print(f"Endpoint Name: {name}")

Usage

Inference Example

import boto3
import json

name = "sagemaker-vllm-neuron-qwen-7b-inf2"
smr_client = boto3.client(service_name='sagemaker-runtime')

payload = {
    "inputs": "What is the capital of France?", 
    "parameters": {
        "max_tokens": 256, 
        "temperature": 0.7,
        "top_p": 0.9, 
    }
}

response = smr_client.invoke_endpoint(
    EndpointName=name,
    Body=json.dumps(payload),
    Accept="application/json",
    ContentType="application/json",
)

result = json.loads(response["Body"].read().decode('utf-8'))
print(result)

API Endpoints

The container exposes two endpoints required by SageMaker:

/ping: Health check endpoint
/invocations: Inference endpoint

Environment Variables

Configure the model behavior using these environment variables:

NUM_CORES: Number of Neuron cores (default: 2)
BATCH_SIZE: Maximum batch size (default: 8)
SEQUENCE_LENGTH: Maximum sequence length (default: 4096)

Container Details

Base Image

AWS PyTorch Inference Neuronx: 2.1.2-neuronx-py310-sdk2.20.1-ubuntu20.04

Key Components

vLLM: v0.6.1.post2 with Neuron support
FastAPI: 0.115.4 for API handling
Transformers: 4.43.2 for model loading
Optimum Neuron: For Inferentia2 optimization

File Structure

/opt/
├── main.py          # FastAPI application
├── serve            # Startup script
└── ml/model/        # Model files (mounted by SageMaker)

Cleanup

import boto3

name = "sagemaker-vllm-neuron-qwen-7b-inf2"
sm_client = boto3.client(service_name='sagemaker')

sm_client.delete_endpoint(EndpointName=name)
sm_client.delete_endpoint_config(EndpointConfigName=name)
sm_client.delete_model(ModelName=name)

Benefits of SageMaker Deployment

Fully Managed Infrastructure: Automatic scaling and patching
Auto Scaling: Responds to workload changes
Monitoring: CloudWatch and CloudTrail integration
Multiple Deployment Options: Real-time, batch, and async endpoints
Advanced Features: Inference recommender and shadow testing

References

Deploying the DeepSeek R1 Distillation Model using Amazon Inferentia2 - Part Two - AWS China Blog (Primary Reference)
AWS Neuron Documentation
vLLM Neuron Installation Guide
SageMaker Custom Containers
DeepSeek R1 Models on Hugging Face

License

This project follows the licensing terms of the underlying components (vLLM, DeepSeek models, etc.).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM Neuron SageMaker BYOC (Bring Your Own Container)

Overview

Supported Models

Architecture

Prerequisites

Quick Start

1. Prepare the Environment

2. Build Docker Image

3. Push to ECR

4. Upload Model to S3

5. Deploy to SageMaker

Usage

Inference Example

API Endpoints

Environment Variables

Container Details

Base Image

Key Components

File Structure

Cleanup

Benefits of SageMaker Deployment

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Dockerfile		Dockerfile
README.md		README.md
install.tar		install.tar
main.py		main.py
serve		serve

heichow/vllm-neuron-sagemaker-byoc

Folders and files

Latest commit

History

Repository files navigation

vLLM Neuron SageMaker BYOC (Bring Your Own Container)

Overview

Supported Models

Architecture

Prerequisites

Quick Start

1. Prepare the Environment

2. Build Docker Image

3. Push to ECR

4. Upload Model to S3

5. Deploy to SageMaker

Usage

Inference Example

API Endpoints

Environment Variables

Container Details

Base Image

Key Components

File Structure

Cleanup

Benefits of SageMaker Deployment

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages