Skip to content

heichow/vllm-neuron-sagemaker-byoc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vLLM Neuron SageMaker BYOC (Bring Your Own Container)

Deploy DeepSeek R1 Distillation models on Amazon SageMaker using vLLM with AWS Inferentia2 chips through a custom Docker container.

Overview

This project provides a custom vLLM inference container for deploying DeepSeek R1 Distillation models on Amazon SageMaker Endpoints using AWS Inferentia2 instances. The solution leverages vLLM's Neuron support for high-performance inference on AWS's custom silicon.

Supported Models

Distillation Model Base Model Recommended Instance
DeepSeek-R1-Distill-Qwen-1.5B Qwen2.5-Math-1.5B ml.inf2.xlarge
DeepSeek-R1-Distill-Qwen-7B Qwen2.5-Math-7B ml.inf2.8xlarge
DeepSeek-R1-Distill-Llama-8B Llama-3.1-8B ml.inf2.8xlarge
DeepSeek-R1-Distill-Qwen-14B Qwen2.5-14B ml.inf2.8xlarge
DeepSeek-R1-Distill-Qwen-32B Qwen2.5-32B ml.inf2.24xlarge
DeepSeek-R1-Distill-Llama-70B Llama-3.3-70B-Instruct ml.inf2.48xlarge/ml.trn1.32xlarge

Architecture

The solution consists of:

  • FastAPI Application (main.py): Handles SageMaker inference requests
  • vLLM Engine: Optimized for Neuron inference on Inferentia2
  • Docker Container: Custom container with vLLM Neuron support
  • SageMaker Integration: Compatible with SageMaker hosting requirements

Prerequisites

  1. AWS Account with SageMaker access
  2. Inferentia2 Quota: Request quota increase for ml.inf2.* instances if needed
  3. ECR Repository: Create a repository named sagemaker-neuron-container
  4. S3 Bucket: Create a bucket for storing model weights (name must contain "sagemaker")
  5. SageMaker Notebook: Recommended ml.t3.large with 100GB storage

Quick Start

1. Prepare the Environment

# Download and extract vLLM installation files
cd ~/SageMaker
wget https://zz-common.s3.us-east-1.amazonaws.com/tmp/install.tar
tar -xvf install.tar
cd install

# Clone vLLM repository
git clone https://github.com/vllm-project/vllm --branch v0.6.1.post2 --single-branch

# Copy customized files
cp arg_utils.py ./vllm/vllm/engine/
cp setup.py ./vllm/
cp neuron.py ./vllm/vllm/model_executor/model_loader/

2. Build Docker Image

cd ~/SageMaker

# Login to ECR
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com

# Build container (takes ~10 minutes)
docker build -t sagemaker-neuron-container:deepseek .

3. Push to ECR

# Replace with your AWS Account ID
account_id=<Your AWS Account ID>

aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin ${account_id}.dkr.ecr.us-west-2.amazonaws.com

docker tag sagemaker-neuron-container:deepseek ${account_id}.dkr.ecr.us-west-2.amazonaws.com/sagemaker-neuron-container:deepseek
docker push ${account_id}.dkr.ecr.us-west-2.amazonaws.com/sagemaker-neuron-container:deepseek

4. Upload Model to S3

# Install required packages
!pip install --upgrade huggingface_hub

# Download model (example: DeepSeek-R1-Distill-Qwen-7B)
from huggingface_hub import snapshot_download

model_id = 'deepseek-ai/DeepSeek-R1-Distill-Qwen-7B'
snapshot_download(repo_id=model_id, local_dir="./models/"+model_id)

# Upload to S3
local_path = "./models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/"
s3_bucket_name = "<YOUR BUCKET NAME>"
s3_path = f"s3://{s3_bucket_name}/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/"

!aws s3 sync {local_path} {s3_path}

5. Deploy to SageMaker

import boto3
import sagemaker 
import time

name = "sagemaker-vllm-neuron-qwen-7b-inf2"
role = sagemaker.get_execution_role()
sm_client = boto3.client(service_name='sagemaker')
account_id = boto3.client("sts").get_caller_identity()["Account"]
region = boto3.Session().region_name

image_url = f"{account_id}.dkr.ecr.{region}.amazonaws.com/sagemaker-neuron-container:deepseek"
s3_bucket_name = "<YOUR BUCKET NAME>"
model_url = f"s3://{s3_bucket_name}/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/"

# Create model
sm_client.create_model(
    ModelName=name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        'Image': image_url,
        "ModelDataSource": {
            "S3DataSource": {
                "S3Uri": model_url, 
                "S3DataType": "S3Prefix",
                "CompressionType": "None",
            },
        },
        'Environment': {
            'NUM_CORES': '2', 
            'BATCH_SIZE': '8',
            'SEQUENCE_LENGTH': '4096',
        }
    }
)

# Create endpoint config
sm_client.create_endpoint_config(
    EndpointConfigName=name,
    ProductionVariants=[{
        'InstanceType': 'ml.inf2.8xlarge',
        'InitialInstanceCount': 1,
        'ModelName': name,
        'VariantName': 'AllTraffic',
        "VolumeSizeInGB": 100,
        "ModelDataDownloadTimeoutInSeconds": 300,
        "ContainerStartupHealthCheckTimeoutInSeconds": 600
    }]
)

# Create endpoint
sm_client.create_endpoint(
    EndpointName=name,
    EndpointConfigName=name
)

print(f"Endpoint Name: {name}")

Usage

Inference Example

import boto3
import json

name = "sagemaker-vllm-neuron-qwen-7b-inf2"
smr_client = boto3.client(service_name='sagemaker-runtime')

payload = {
    "inputs": "What is the capital of France?", 
    "parameters": {
        "max_tokens": 256, 
        "temperature": 0.7,
        "top_p": 0.9, 
    }
}

response = smr_client.invoke_endpoint(
    EndpointName=name,
    Body=json.dumps(payload),
    Accept="application/json",
    ContentType="application/json",
)

result = json.loads(response["Body"].read().decode('utf-8'))
print(result)

API Endpoints

The container exposes two endpoints required by SageMaker:

  • /ping: Health check endpoint
  • /invocations: Inference endpoint

Environment Variables

Configure the model behavior using these environment variables:

  • NUM_CORES: Number of Neuron cores (default: 2)
  • BATCH_SIZE: Maximum batch size (default: 8)
  • SEQUENCE_LENGTH: Maximum sequence length (default: 4096)

Container Details

Base Image

  • AWS PyTorch Inference Neuronx: 2.1.2-neuronx-py310-sdk2.20.1-ubuntu20.04

Key Components

  • vLLM: v0.6.1.post2 with Neuron support
  • FastAPI: 0.115.4 for API handling
  • Transformers: 4.43.2 for model loading
  • Optimum Neuron: For Inferentia2 optimization

File Structure

/opt/
├── main.py          # FastAPI application
├── serve            # Startup script
└── ml/model/        # Model files (mounted by SageMaker)

Cleanup

import boto3

name = "sagemaker-vllm-neuron-qwen-7b-inf2"
sm_client = boto3.client(service_name='sagemaker')

sm_client.delete_endpoint(EndpointName=name)
sm_client.delete_endpoint_config(EndpointConfigName=name)
sm_client.delete_model(ModelName=name)

Benefits of SageMaker Deployment

  • Fully Managed Infrastructure: Automatic scaling and patching
  • Auto Scaling: Responds to workload changes
  • Monitoring: CloudWatch and CloudTrail integration
  • Multiple Deployment Options: Real-time, batch, and async endpoints
  • Advanced Features: Inference recommender and shadow testing

References

License

This project follows the licensing terms of the underlying components (vLLM, DeepSeek models, etc.).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors