Skip to content

Commit a35f80f

Browse files
committed
feat: support for flux framework as hpc manager
Flux supports the majority of MPI flavors/variants, and can be used to bootstrap MPI as a plugin. It adds other features for scheduling and topology that can be used for simulations and ai/ml jobs. This changeset adds the plugin implementation, including the plugin module, tests, and an example with a small README to serve as documentation for the time being. Signed-off-by: vsoch <vsoch@users.noreply.github.com>
1 parent 1fe3bd3 commit a35f80f

32 files changed

+1802
-9
lines changed

api/openapi-spec/swagger.json

Lines changed: 27 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

api/python_api/kubeflow_trainer_api/models/__init__.py

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_flux_ml_policy_source.py

Lines changed: 87 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_hpcml_policy_source.py

Lines changed: 91 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_ml_policy.py

Lines changed: 7 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_ml_policy_source.py

Lines changed: 7 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

charts/kubeflow-trainer/crds/trainer.kubeflow.org_clustertrainingruntimes.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,16 @@ spec:
4747
description: mlPolicy provides the ML-specific parameters for the
4848
model training.
4949
properties:
50+
flux:
51+
description: flux defines the configuration for the Flux runtime.
52+
properties:
53+
numProcPerNode:
54+
default: 1
55+
description: numProcPerNode is the number of processes per
56+
node.
57+
format: int32
58+
type: integer
59+
type: object
5060
mpi:
5161
description: mpi defines the configuration for the MPI Runtime.
5262
properties:

charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainingruntimes.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,16 @@ spec:
4747
description: mlPolicy provides the ML-specific parameters for the
4848
model training.
4949
properties:
50+
flux:
51+
description: flux defines the configuration for the Flux runtime.
52+
properties:
53+
numProcPerNode:
54+
default: 1
55+
description: numProcPerNode is the number of processes per
56+
node.
57+
format: int32
58+
type: integer
59+
type: object
5060
mpi:
5161
description: mpi defines the configuration for the MPI Runtime.
5262
properties:
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# This example deploys the LAMMPS Molecular Dynamic Simulator
2+
# with MPI orchestrated by the Flux workload manager on 4 nodes.
3+
# The problem size is defined by the coordinates x,y,z, and the
4+
# parameter file reaxc.hns.
5+
# The image has the application, LAMMPS, installed (no Flux)
6+
# A Flux view will be added on the fly by the Kubeflow trainer
7+
# The 4 pods ideally map 1:1 to nodes, encompassing a cluster
8+
# The underlying abstraction is a JobSet with a headless service
9+
# Flux supports low-latency with Infiniband, EFA, etc., however
10+
# standard ethernet is used here.
11+
apiVersion: trainer.kubeflow.org/v1alpha1
12+
kind: TrainJob
13+
metadata:
14+
name: lammps-flux-interactive
15+
spec:
16+
# Reference the pre-defined runtime by name
17+
runtimeRef:
18+
name: flux-runtime
19+
trainer:
20+
numNodes: 4
21+
image: ghcr.io/converged-computing/metric-lammps:latest
22+
# You do not need to write "flux run, etc" here. It will be wrapped
23+
command: ["lmp", "-v", "x", "2", "-v", "y", "2", "-v", "z", "2", "-in", "in.reaxc.hns", "-nocite"]
24+
# Comment out the command above to make an interactive cluster! Then shell into the 0-0 pod:
25+
# # Source environment
26+
# . /mnt/flux/flux-view.sh
27+
# # Connect to the running lead broker socket
28+
# flux proxy $fluxsocket bash
29+
# # See Flux resources!
30+
# flux resource list
31+
# Run lammps!
32+
# flux run -N 4 -n 4 lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite

manifests/base/crds/trainer.kubeflow.org_clustertrainingruntimes.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,16 @@ spec:
4747
description: mlPolicy provides the ML-specific parameters for the
4848
model training.
4949
properties:
50+
flux:
51+
description: flux defines the configuration for the Flux runtime.
52+
properties:
53+
numProcPerNode:
54+
default: 1
55+
description: numProcPerNode is the number of processes per
56+
node.
57+
format: int32
58+
type: integer
59+
type: object
5060
mpi:
5161
description: mpi defines the configuration for the MPI Runtime.
5262
properties:

0 commit comments

Comments
 (0)