Deploy DeepSeek R1 Distillation models on Amazon SageMaker using vLLM with AWS Inferentia2 chips through a custom Docker container.
This project provides a custom vLLM inference container for deploying DeepSeek R1 Distillation models on Amazon SageMaker Endpoints using AWS Inferentia2 instances. The solution leverages vLLM's Neuron support for high-performance inference on AWS's custom silicon.
| Distillation Model | Base Model | Recommended Instance |
|---|---|---|
| DeepSeek-R1-Distill-Qwen-1.5B | Qwen2.5-Math-1.5B | ml.inf2.xlarge |
| DeepSeek-R1-Distill-Qwen-7B | Qwen2.5-Math-7B | ml.inf2.8xlarge |
| DeepSeek-R1-Distill-Llama-8B | Llama-3.1-8B | ml.inf2.8xlarge |
| DeepSeek-R1-Distill-Qwen-14B | Qwen2.5-14B | ml.inf2.8xlarge |
| DeepSeek-R1-Distill-Qwen-32B | Qwen2.5-32B | ml.inf2.24xlarge |
| DeepSeek-R1-Distill-Llama-70B | Llama-3.3-70B-Instruct | ml.inf2.48xlarge/ml.trn1.32xlarge |
The solution consists of:
- FastAPI Application (
main.py): Handles SageMaker inference requests - vLLM Engine: Optimized for Neuron inference on Inferentia2
- Docker Container: Custom container with vLLM Neuron support
- SageMaker Integration: Compatible with SageMaker hosting requirements
- AWS Account with SageMaker access
- Inferentia2 Quota: Request quota increase for ml.inf2.* instances if needed
- ECR Repository: Create a repository named
sagemaker-neuron-container - S3 Bucket: Create a bucket for storing model weights (name must contain "sagemaker")
- SageMaker Notebook: Recommended ml.t3.large with 100GB storage
# Download and extract vLLM installation files
cd ~/SageMaker
wget https://zz-common.s3.us-east-1.amazonaws.com/tmp/install.tar
tar -xvf install.tar
cd install
# Clone vLLM repository
git clone https://github.com/vllm-project/vllm --branch v0.6.1.post2 --single-branch
# Copy customized files
cp arg_utils.py ./vllm/vllm/engine/
cp setup.py ./vllm/
cp neuron.py ./vllm/vllm/model_executor/model_loader/cd ~/SageMaker
# Login to ECR
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com
# Build container (takes ~10 minutes)
docker build -t sagemaker-neuron-container:deepseek .# Replace with your AWS Account ID
account_id=<Your AWS Account ID>
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin ${account_id}.dkr.ecr.us-west-2.amazonaws.com
docker tag sagemaker-neuron-container:deepseek ${account_id}.dkr.ecr.us-west-2.amazonaws.com/sagemaker-neuron-container:deepseek
docker push ${account_id}.dkr.ecr.us-west-2.amazonaws.com/sagemaker-neuron-container:deepseek# Install required packages
!pip install --upgrade huggingface_hub
# Download model (example: DeepSeek-R1-Distill-Qwen-7B)
from huggingface_hub import snapshot_download
model_id = 'deepseek-ai/DeepSeek-R1-Distill-Qwen-7B'
snapshot_download(repo_id=model_id, local_dir="./models/"+model_id)
# Upload to S3
local_path = "./models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/"
s3_bucket_name = "<YOUR BUCKET NAME>"
s3_path = f"s3://{s3_bucket_name}/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/"
!aws s3 sync {local_path} {s3_path}import boto3
import sagemaker
import time
name = "sagemaker-vllm-neuron-qwen-7b-inf2"
role = sagemaker.get_execution_role()
sm_client = boto3.client(service_name='sagemaker')
account_id = boto3.client("sts").get_caller_identity()["Account"]
region = boto3.Session().region_name
image_url = f"{account_id}.dkr.ecr.{region}.amazonaws.com/sagemaker-neuron-container:deepseek"
s3_bucket_name = "<YOUR BUCKET NAME>"
model_url = f"s3://{s3_bucket_name}/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/"
# Create model
sm_client.create_model(
ModelName=name,
ExecutionRoleArn=role,
PrimaryContainer={
'Image': image_url,
"ModelDataSource": {
"S3DataSource": {
"S3Uri": model_url,
"S3DataType": "S3Prefix",
"CompressionType": "None",
},
},
'Environment': {
'NUM_CORES': '2',
'BATCH_SIZE': '8',
'SEQUENCE_LENGTH': '4096',
}
}
)
# Create endpoint config
sm_client.create_endpoint_config(
EndpointConfigName=name,
ProductionVariants=[{
'InstanceType': 'ml.inf2.8xlarge',
'InitialInstanceCount': 1,
'ModelName': name,
'VariantName': 'AllTraffic',
"VolumeSizeInGB": 100,
"ModelDataDownloadTimeoutInSeconds": 300,
"ContainerStartupHealthCheckTimeoutInSeconds": 600
}]
)
# Create endpoint
sm_client.create_endpoint(
EndpointName=name,
EndpointConfigName=name
)
print(f"Endpoint Name: {name}")import boto3
import json
name = "sagemaker-vllm-neuron-qwen-7b-inf2"
smr_client = boto3.client(service_name='sagemaker-runtime')
payload = {
"inputs": "What is the capital of France?",
"parameters": {
"max_tokens": 256,
"temperature": 0.7,
"top_p": 0.9,
}
}
response = smr_client.invoke_endpoint(
EndpointName=name,
Body=json.dumps(payload),
Accept="application/json",
ContentType="application/json",
)
result = json.loads(response["Body"].read().decode('utf-8'))
print(result)The container exposes two endpoints required by SageMaker:
/ping: Health check endpoint/invocations: Inference endpoint
Configure the model behavior using these environment variables:
NUM_CORES: Number of Neuron cores (default: 2)BATCH_SIZE: Maximum batch size (default: 8)SEQUENCE_LENGTH: Maximum sequence length (default: 4096)
- AWS PyTorch Inference Neuronx:
2.1.2-neuronx-py310-sdk2.20.1-ubuntu20.04
- vLLM: v0.6.1.post2 with Neuron support
- FastAPI: 0.115.4 for API handling
- Transformers: 4.43.2 for model loading
- Optimum Neuron: For Inferentia2 optimization
/opt/
├── main.py # FastAPI application
├── serve # Startup script
└── ml/model/ # Model files (mounted by SageMaker)
import boto3
name = "sagemaker-vllm-neuron-qwen-7b-inf2"
sm_client = boto3.client(service_name='sagemaker')
sm_client.delete_endpoint(EndpointName=name)
sm_client.delete_endpoint_config(EndpointConfigName=name)
sm_client.delete_model(ModelName=name)- Fully Managed Infrastructure: Automatic scaling and patching
- Auto Scaling: Responds to workload changes
- Monitoring: CloudWatch and CloudTrail integration
- Multiple Deployment Options: Real-time, batch, and async endpoints
- Advanced Features: Inference recommender and shadow testing
- Deploying the DeepSeek R1 Distillation Model using Amazon Inferentia2 - Part Two - AWS China Blog (Primary Reference)
- AWS Neuron Documentation
- vLLM Neuron Installation Guide
- SageMaker Custom Containers
- DeepSeek R1 Models on Hugging Face
This project follows the licensing terms of the underlying components (vLLM, DeepSeek models, etc.).