Add Nvidia MPS component for managing Nvidia GPU resources#647
Add Nvidia MPS component for managing Nvidia GPU resources#647mishaschwartz wants to merge 9 commits intomasterfrom
Conversation
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/4052/Result ❌ FAILUREBIRDHOUSE_DEPLOY_BRANCH : nvidia-mps DACCS_IAC_BRANCH : master DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-91.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/665/NOTEBOOK TEST RESULTS |
| "gpu_ids": ["0", "1", "2"], | ||
| "gpu_count": 3, | ||
| "gpu_device_mem_limit": "0=1G,1=5G,2=10G", | ||
| "gpu_active_thread_percentage": "10" |
There was a problem hiding this comment.
Does the <gpu-id>=<thread-count>,... variant works for this also?
There was a problem hiding this comment.
As far as I can tell no... the Nvidia documentation is pretty horrendous but all the documentation and examples I've seen only shows that you can specify limits for specific GPUs for the memory limit, not the active thread percentage.
CHANGES.md
Outdated
| user when they configure the spawner in `JUPYTERHUB_CONFIG_OVERRIDE`. | ||
|
|
||
| This also introduces the `JUPYTERHUB_CONFIG_OVERRIDE_INTERNAL` variable which is identical to the | ||
| `JUPYTERHUB_CONFIG_OVERRIDE` variable except that it is intended to only be set by other components (not be the |
There was a problem hiding this comment.
typo - not "by"
Also, is there any safeguard to establish to ensure users do not employ it? Do we let it as is without checks, letting users "break it at their own risk" as long as there are enough warnings in the docs about it?
Taking inspiration from the readonly used for MPS, maybe that could be set after loading all component's envs and just before loading the user env.local?
And finally, should something be indicated about components using JUPYTERHUB_CONFIG_OVERRIDE_INTERNAL to add COMPONENT_DEPENDENCIES with components/jupyterhub, or ensuring the nested inclusion of JUPYTERHUB_CONFIG_OVERRIDE_INTERNAL? The COMPONENT_DEPENDENCIES are not applied in the current MPS default.env. I think nothing actually ensures the resolution order beside pure sorted naming of components? If jupyterhub happened to be loaded toward the end because of another dependency, wouldn't its default override all others previously set?
There was a problem hiding this comment.
Also, is there any safeguard to establish to ensure users do not employ it?
We definitely could do that. There are lots of variables that are for internal use only that we don't technically stop users from clobbering. I'll open an issue about that (see #649), it can be part of the general discussion about how we treat variables that we started in #629.
I think nothing actually ensures the resolution order beside pure sorted naming of components?
If there are no explicit dependencies then they'll follow the order that they're defined in BIRDHOUSE_EXTRA_CONF_DIRS.
If jupyterhub happened to be loaded toward the end because of another dependency, wouldn't its default override all others previously set?
Yes that's a good point, I'll correct for that.
birdhouse/components/jupyterhub/jupyterhub_custom/jupyterhub_custom/custom_dockerspawner.py
Show resolved
Hide resolved
| if [ "$(nvidia-smi --query-gpu=compute_mode --format=csv,noheader | grep -vc 'Exclusive_Process')" -ne 0 ]; then | ||
| log WARN "Nvidia GPUs with compute mode set to something other than EXCLUSIVE_PROCESS detected. We recommend you set the compute mode to EXCLUSIVE_PROCESS when enabling nvidia's Multi Process Service (MPS)." | ||
| fi |
There was a problem hiding this comment.
Is that a hard-requirement for MPS to work, or some other efficiency reason?
It is not clear why it is recommended without context (I haven't played with that service).
If it is a hard-requirement, maybe it should not WARN, but ERROR (and exit?). Depends on the reason.
Activating this component without a GPU/Nvidia-SMI will cause a command error. That is fine since it won't work anyway, but maybe the error should be more gracefully handled?
There was a problem hiding this comment.
It is not a hard requirement which is why it's only a warning. Again, the documentation is awful so I'm only 90% sure this is the reason:
- MPS runs a server in a process for each GPU that manages resource allocation on the GPU
- By setting
EXCLUSIVE_PROCESSwe ensure that only that server process has direct access to the GPU and every cuda client process has to go through that server process - By setting it to the default value, another client process could access the GPU directly, sidestepping the MPS entirely.
There are valid use-cases for running MPS on a GPU without EXCLUSIVE_PROCESS set but those are pretty niche and I don't know enough to fully understand the implications right now.
Activating this component without a GPU/Nvidia-SMI will cause a command error.
True, I'll add better error handling for this
| - driver: nvidia | ||
| count: all | ||
| capabilities: [gpu] |
There was a problem hiding this comment.
Should this be configurable as well? For example, some GPUs reserved for Jupyter and others reserved for other operations (eg: Weaver Workers) ?
Is it better to have all GPU-enabled operations connected to this MPS regardless of the way they are used, and have limited gpu_ids defined for jupyter vs others, or instead, have multiple sub-GPU partitions where what this MPS "sees" corresponds to 100% of a given sub-partition and its jupyter-dockerspawner resources do not need gpu_ids (or they do, but relative to that %sub-GPU partition) ?
There was a problem hiding this comment.
I'm going to say that it's better to make everything go through the MPS and then divide up the GPUs when they're assigned to containers (jupyterlab or weaver workers).
The only exception I can think of is if a user has a subset of GPUs that they want to use for birdhouse and another set that they want to use for something else entirely on the same machine. I guess I can make this configurable but if a user is doing something other than the default they have to really really know what they're doing so they don't break things.
There was a problem hiding this comment.
Actually you know what... the problem here is actually how docker compose configures this. The count and device_ids keys are mutually exclusive and you can't use variable substitution to add lists.
The only way to to this would be to create an other optional component with a docker-compose-extra.yml file that contained something like:
mps:
deploy:
resources:
reservations:
devices: !override
- capabilities: [gpu]
driver: nvidia
device_ids: ["0", "1"]If you want to only allow a subset of GPUs.
For now I'll document this with a comment but actually configuring this would require a whole other component.
There was a problem hiding this comment.
OK to address in another PR with documentation for the time being.
Indeed, the count/device_ids conflict is an important consideration. In my case, I would prioritize weaver-workers over jupyter using GPUs, probably giving a few lower-VRAM ones to Jupyter, and leaving the VRAM-heavy computations to dedicated weaver-workers. Since weaver-workers launch dockers via docker-proxy, I am not sure that I could use count, since I would need to pre-partition the specific devices for each service.
I guess that also raises another question. How are other non-Jupyter/DockerSpawner services supposed to map with mps service to employ the GPUs it manages? Is there a specific set of options to configure (passing the ipc ID or whatnot?). The connection between mps and jupyterhub is somewhat hard to interpret because of the intermediate DockerSpawner layer.
There was a problem hiding this comment.
How are other non-Jupyter/DockerSpawner services supposed to map with mps service to employ the GPUs it manages?
Yeah that's going to be the subject of a future PR. But to summarize here, you'd need to (for all containers that access GPUs):
- ensure that it's using the ipc from the mps container
- has the
nvidia_mpstmpfs mounted to/tmp/nvidia-mps - has the gpu devices enabled on the container
There's a reference to this in the PR but I find that this example docker compose project outlines the setup nicely:
https://gitlab.com/nvidia/container-images/samples/-/blob/master/mps/docker-compose.yml
(note that the syntax on that file is slightly outdated but it gives the right idea)
birdhouse/optional-components/nvidia-multi-process-service/02-readonly-cuda-vars.sh
Show resolved
Hide resolved
birdhouse/optional-components/nvidia-multi-process-service/default.env
Outdated
Show resolved
Hide resolved
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/4054/Result ✅ SUCCESSBIRDHOUSE_DEPLOY_BRANCH : nvidia-mps DACCS_IAC_BRANCH : master DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-91.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/667/NOTEBOOK TEST RESULTS |
Overview
This creates a container running Nvidia's Multi Process Service (MPS) which helps manage multi-user GPU access.
It runs an alternative CUDA interface which manages resource allocation when multiple processes are running simultaneously on the same GPU.
It also allows the node admin to set additional per-user limits through the
JUPYTERHUB_RESOURCE_LIMITSvariable which configures Jupyterlab containers:"gpu_device_mem_limit": sets theCUDA_MPS_PINNED_DEVICE_MEM_LIMITenvironment variable"gpu_active_thread_percentage": sets theCUDA_MPS_ACTIVE_THREAD_PERCENTAGEenvironment variableFor example, the following will give all users in the group named
"users"access to three GPUs in their Jupyterlab container. On the first one (id = 0) only 1GB of memory is available, on the second (id = 1) only 5GB, and on the third (id = 2) only 10GB. Additionally, the container will be able to use 10% of available threads on the GPUs.Note that leaving any of these limits unset will default to allowing the user full access to the given resource.
Update
CustomDockerSpawnerto make pre spawn hooks and resource limits more configurableIntroduce
pre_spawn_hooksandresource_limit_callbacksattributes to theCustomDockerSpawnerclass whichcan be used to further customize the
CustomDockerSpawnerfrom optional components. This gives us a way toadd additional functionality without having to directly modify existing functions which may be overwritten by the
user when they configure the spawner in
JUPYTERHUB_CONFIG_OVERRIDE.This also introduces the
JUPYTERHUB_CONFIG_OVERRIDE_INTERNALvariable which is identical to theJUPYTERHUB_CONFIG_OVERRIDEvariable except that it is intended to only be set by other components (not be theuser in the local environment file). This allows components to customize Jupyterhub deployments without interfering
with custom settings created by the user. Note that
JUPYTERHUB_CONFIG_OVERRIDEhas precedence overJUPYTERHUB_CONFIG_OVERRIDE_INTERNAL.Fixes some examples that showed that
gpu_idscould be given as integers if they were meant to be indexes. However, due to limitation of docker they must be strings. This modifies examples so that it is clear that strings must be used and also updates the code to ensure that string values are only ever passed to docker when spawning a new jupyterlab server.Non-breaking changes
Breaking changes
Related Issue / Discussion
Additional Information
CI Operations
birdhouse_daccs_configs_branch: master
birdhouse_skip_ci: false