Skip to content

Add create_all_gather_group configuration option#15253

Merged
chtruong814 merged 6 commits intoNVIDIA-NeMo:mainfrom
jeffnvidia:all_gather_param
Feb 5, 2026
Merged

Add create_all_gather_group configuration option#15253
chtruong814 merged 6 commits intoNVIDIA-NeMo:mainfrom
jeffnvidia:all_gather_param

Conversation

@jeffnvidia
Copy link
Contributor

@jeffnvidia jeffnvidia commented Jan 5, 2026

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

PR related

This PR is directly related to this megatron-lm PR : NVIDIA/Megatron-LM#2663

What does this PR do ?

Adds create_all_gather_group configuration option to enable overlapping reduce-scatter and all-gather.

Collection: Core / Distributed Training

Changelog

  • Add create_all_gather_group parameter to ParallelismConfig class in megatron_strategy.py
  • Add create_all_gather_group parameter to MegatronStrategy class initialization and setup
  • Add create_all_gather_group property and setter to AppState class in app_state.py
  • Propagate create_all_gather_group configuration through init_parallel_ranks() and initialize_model_parallel_for_nemo() functions
  • Add documentation explaining the purpose of the new parameter (enables separate process group for all-gather operations)

Usage

Users can now enable overlapping of reduce-scatter and all-gather operations by setting the configuration:

from nemo.lightning.pytorch.strategies import MegatronStrategy, ParallelismConfig

Configure parallelism with all-gather group creation

parallel_config = ParallelismConfig(
tensor_model_parallel_size=2,
pipeline_model_parallel_size=2,
create_all_gather_group=True, # Enable separate all-gather process group
)

strategy = MegatronStrategy(
parallel_config=parallel_config,
create_all_gather_group=True,
)# GitHub Actions CI

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, P

Signed-off-by: jeffnvidia <jmahou@nvidia.com>
Signed-off-by: jeffnvidia <jmahou@nvidia.com>
@jeffnvidia
Copy link
Contributor Author

The formatting check seems to be failing due to a Git history issue after force-pushing to add sign-offs, not actual formatting problems. I've verified locally that both isort and black checks pass. The workflows are awaiting maintainer approval to run properly.

@jeffnvidia
Copy link
Contributor Author

Hi @nithinraok, could you review this PR or tell me who I could refer to ? Thanks !

@jeffnvidia
Copy link
Contributor Author

@ericharper @yaoyu-33

Could you please add the "Run CICD" label to trigger the CI tests when you're ready to review?

Note: The reformat_with_isort_and_black check is currently failing due to a GitHub Actions issue (it can't find an old commit SHA after I rebased to add DCO sign-offs). I've verified locally that all formatting checks pass with python setup.py style.

Thanks!

ericharper
ericharper previously approved these changes Jan 22, 2026
@ericharper ericharper enabled auto-merge (squash) January 22, 2026 22:42
auto-merge was automatically disabled January 29, 2026 14:09

Head branch was pushed to by a user without write access

@jeffnvidia
Copy link
Contributor Author

@ericharper I changed the manifest, I think it needs to be re-run now (there is an error because of a commit change due to a force push)

@jeffnvidia
Copy link
Contributor Author

jeffnvidia commented Feb 2, 2026

Hey @ericharper @ko3n1g ,

I reverted the changes to the manifest and I added backward compatibility, let me know if I need to do more things now

Signed-off-by: jeffnvidia <jmahou@nvidia.com>
chtruong814
chtruong814 previously approved these changes Feb 3, 2026
chtruong814
chtruong814 previously approved these changes Feb 4, 2026
@chtruong814 chtruong814 enabled auto-merge (squash) February 4, 2026 14:12
Signed-off-by: jeffnvidia <jmahou@nvidia.com>
auto-merge was automatically disabled February 5, 2026 09:42

Head branch was pushed to by a user without write access

@chtruong814 chtruong814 merged commit 5e9d7f4 into NVIDIA-NeMo:main Feb 5, 2026
150 of 179 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants