Welcome to this quickstart guide on integrating Cloudsmith artifact management with your Sagemaker workflows.
A Cloudsmith repository with:
- 2 Python upstreams pointing to:
- distilgpt2 model
Install: awscli v2, docker, python3.
Copy the .env-example to .env and fill in:
CLOUDSMITH_NAMESPACE=
CLOUDSMITH_REPO=your-
CLOUDSMITH_USERNAME=
CLOUDSMITH_API_KEY=
AWS_REGION=us-east-1
You must be logged in/authenticated with AWS CLI in your current session, as well as Cloudsmith's Docker repository you have created.
Creates: secret, S3 bucket, IAM role, VPC + registry auth Lambda, auto-updates .env with bootstrap output:
bash infrastructure/bootstrap.sh
DOCKER_BUILDKIT=1
bash scripts/build_and_push_training.sh
bash scripts/build_and_push_inference.sh
python scripts/launch_training_job.py --epochs 1 --instance-type ml.m5.large
Training downloads base model from Cloudsmith HF endpoint, fine‑tunes, then uploads a new revision and records CLOUDSMITH_HF_FINETUNED_REVISION.
python scripts/deploy_inference_endpoint.py \
--endpoint-name cloudsmith-inference-endpoint \
--startup-timeout 180
aws sagemaker-runtime invoke-endpoint \
--endpoint-name cloudsmith-inference-endpoint \
--region $AWS_REGION \
--content-type application/json \
--cli-binary-format raw-in-base64-out \
--body '{"inputs":"Hi"}' /dev/stdout
Use infrastructure/cleanup.sh to remove demo resources (endpoint, models/configs, IAM roles, Lambda, VPC + sub-resources, S3 bucket, secret).
Please note that some VPC resources may need to be removed manually
Dry run:
DRY_RUN=true SKIP_CONFIRM=true bash infrastructure/cleanup.sh
Cleanup:
bash infrastructure/cleanup.sh
Use Cloudsmith’s Hugging Face-compatible endpoint (upload + snapshot download only).
from huggingface_hub import HfApi
import os
ns = os.environ['CLOUDSMITH_NAMESPACE']
repo = os.environ['CLOUDSMITH_REPO']
token = os.environ['CLOUDSMITH_API_KEY'] # or fetched from Secrets Manager
endpoint = f"https://huggingface.cloudsmith.io/{ns}/{repo}"
api = HfApi(token=token, endpoint=endpoint)
api.upload_folder(
folder_path="/opt/ml/model", # directory with model files
repo_id="distilbert-base-uncased-finetuned",
repo_type="model",
token=token,
)from huggingface_hub import HfApi
import os
ns = os.environ['CLOUDSMITH_NAMESPACE']
repo = os.environ['CLOUDSMITH_REPO']
token = os.environ['CLOUDSMITH_API_KEY']
endpoint = f"https://huggingface.cloudsmith.io/{ns}/{repo}"
api = HfApi(token=token, endpoint=endpoint)
local_dir = "/opt/ml/model"
api.snapshot_download(
repo_id="distilbert-base-uncased-finetuned", # or base model name
repo_type="model",
revision="main", # or a specific uploaded revision hash/tag
local_dir=local_dir,
token=token,
)SageMaker must reach the private Cloudsmith registry (container + Hugging Face endpoints) over the public internet. When you use a private image, you launch the training job / endpoint inside a VPC so the underlying instance has:
- Private subnets (we create two) where the job runs.
- An egress path (NAT) so those subnets can reach
docker.cloudsmith.ioandhuggingface.cloudsmith.ioto pull the image and model files. - A security group that allows outbound HTTPS (inbound not needed for pulls).
If egress is missing (no NAT / route), image pull or model download will fail.
AWS does not currently authenticate to private third‑party registries during SageMaker CreateModel for real‑time inference; only ECR (or public/no‑auth images) are supported.