-
Notifications
You must be signed in to change notification settings - Fork 161
Description
Data scientists train models using MLFlow but must manually package them as ModelKits for deployment. This breaks the workflow continuity and creates opportunities for version mismatches between what was tracked in MLFlow and what gets deployed.
Proposed Solution
Add MLFlow as an import source with URI syntax: kit import mlflow://[tracking_uri/]experiments/{exp_id}/runs/{run_id}
Architecture
Leverage existing kit init pipeline (which already runs during import). Your work is:
- MLFlow URI handler - parse URIs, extract tracking server + run identifiers
- Artifact downloader - use pull run artifacts to a temp directory
- Metadata enrichment - inject MLFlow provenance into generated Kitfile
- Error handling - deal with auth, large files, incomplete runs
Implementation Details
URI Format
# With explicit tracking URI
kit import mlflow://mlflow.company.com/experiments/42/runs/abc123 -t mymodel:v1
# Using MLFLOW_TRACKING_URI env var
export MLFLOW_TRACKING_URI=http://localhost:5000
kit import mlflow://experiments/42/runs/abc123 -t mymodel:v1
# Short form (uses default experiment)
kit import mlflow://runs/abc123 -t mymodel:v1Data Flow
1. Parse mlflow:// URI → tracking_uri, experiment_id, run_id
2. MLFlow client: fetch run metadata + list artifacts
3. Download artifacts to temp dir (filtered by size/type)
4. Run kit init on temp dir → generates Kitfile
5. Augment Kitfile with MLFlow provenance metadata
6. Pack ModelKit using existing pipeline
Implementation Challenges
1. Authentication Hell
MLFlow supports multiple auth mechanisms with no standard:
- Basic auth (username/password)
- Token-based (custom headers)
- Cloud provider auth (AWS IAM, GCP service accounts) for artifact stores
- No auth (local/trusted network)
Approach:
- Support
MLFLOW_TRACKING_URIandMLFLOW_TRACKING_TOKENenv vars - Document that artifact store auth must be handled separately (AWS credentials, GCS keys, etc.)
- Fail fast with clear error messages when auth is missing
2. Large Artifact Handling
50GB model checkpoint will timeout/OOM with naive download make sure to use filters to download only what needs to be packed
3. Incomplete/Failed Runs
MLFlow runs can be RUNNING, FAILED, or KILLED. Artifacts may be partial. Only import FINISHED runs
4. Storage Backend Diversity
MLFlow artifact stores can be:
- Local filesystem (
file:///) - S3 (
s3://) - GCS (
gs://) - Azure (
wasbs://) - SFTP, NFS, etc.
We can either implement these backends on Kit or rely on MLFlow client.