Out-of-date info re. GPUs and exclusivity

(Following up on an email thread)

It looks like the page [Running your AI training jobs on Satori using Slurm
](https://github.com/mit-satori/getting-started/blob/master/satori-workload-manager-using-slurm.rst) contains some incorrect info on GPUs and exclusivity. I'm guessing this might be left over from a time when GPUs were exposed to jobs differently? E.g.:

https://github.com/mit-satori/getting-started/blob/940cdd68d3f0d98d3e405c1dae33208263ce0fff/satori-workload-manager-using-slurm.rst#L33-L40

I don't think any additional checking is required, nor is it necessary to request exclusive use of the node... my understanding of the current behavior (per @adamdrucker) is that a job gets exclusive use of any GPUs requested via the `--gres` flag, and my experience is that any additional unallocated GPUs are simply not exposed to a job at all.

https://github.com/mit-satori/getting-started/blob/940cdd68d3f0d98d3e405c1dae33208263ce0fff/satori-workload-manager-using-slurm.rst#L65-L78

I believe the first command above is sufficient to ensure that nobody else can allocate the four GPUs on the node, right?

https://github.com/mit-satori/getting-started/blob/940cdd68d3f0d98d3e405c1dae33208263ce0fff/satori-workload-manager-using-slurm.rst#L178-L179

Again, my understanding is that this isn't necessary and may be detrimental; requesting all GPUs on a node is sufficient to ensure exclusive access to the job, and omitting the `--exclusive` flag unless it's really needed (e.g. you need _all_ resources available on a node, not just all GPUs) would give the scheduler more flexibility to combine GPU-heavy, CPU-light jobs with those that need only CPU cores.

Don't have the bandwidth to open a PR at the moment, but hope the above helps! (And please let me know if I misunderstood any of this...)

	exclusive. That means that unless you ask otherwise, the GPUs on the node(s)
	you are assigned may already be in use by another user. That means if you
	request a node with 2GPU's the 2 other GPUs on that node may be engaged by
	another job. This allows us to more efficently allocate all of the GPU
	resources. This may require some additional checking to make sure you can
	uniquely use all of the GPU's on a machine. If you're in doubt, you can request
	the node to be 'exclusive' . See below on how to request exclusive access in
	an interactive and batch situation.

	srun --gres=gpu:4 -N 1 --mem=1T --time 1:00:00 -I --pty /bin/bash


	This will request an AC922 node with 4x GPUs from the Satori (normal
	queue) for 1 hour.

	If you need to make sure no one else can allocate the unused GPU's on the machine you can use

	.. code:: bash

	srun --gres=gpu:4 -N 1 --exclusive --mem=1T --time 1:00:00 -I --pty /bin/bash

	this will request exclusive use of an interactive node with 4GPU's

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out-of-date info re. GPUs and exclusivity #36

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	- line 13: ``--exclusive`` means that you want full use of the GPUS on the nodes you are reserving. Leaving this out allows
	the GPU resources you're not using on the node to be shared.

Out-of-date info re. GPUs and exclusivity #36

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions