Add create_all_gather_group configuration option by jeffnvidia · Pull Request #15253 · NVIDIA-NeMo/NeMo

jeffnvidia · 2026-01-05T11:48:15Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

PR related

This PR is directly related to this megatron-lm PR : NVIDIA/Megatron-LM#2663

What does this PR do ?

Adds create_all_gather_group configuration option to enable overlapping reduce-scatter and all-gather.

Collection: Core / Distributed Training

Changelog

Add create_all_gather_group parameter to ParallelismConfig class in megatron_strategy.py
Add create_all_gather_group parameter to MegatronStrategy class initialization and setup
Add create_all_gather_group property and setter to AppState class in app_state.py
Propagate create_all_gather_group configuration through init_parallel_ranks() and initialize_model_parallel_for_nemo() functions
Add documentation explaining the purpose of the new parameter (enables separate process group for all-gather operations)

Usage

Users can now enable overlapping of reduce-scatter and all-gather operations by setting the configuration:

from nemo.lightning.pytorch.strategies import MegatronStrategy, ParallelismConfig

Configure parallelism with all-gather group creation

parallel_config = ParallelismConfig(
tensor_model_parallel_size=2,
pipeline_model_parallel_size=2,
create_all_gather_group=True, # Enable separate all-gather process group
)

strategy = MegatronStrategy(
parallel_config=parallel_config,
create_all_gather_group=True,
)# GitHub Actions CI

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, P

Signed-off-by: jeffnvidia <jmahou@nvidia.com>

jeffnvidia · 2026-01-12T13:44:20Z

The formatting check seems to be failing due to a Git history issue after force-pushing to add sign-offs, not actual formatting problems. I've verified locally that both isort and black checks pass. The workflows are awaiting maintainer approval to run properly.

jeffnvidia · 2026-01-14T09:56:55Z

Hi @nithinraok, could you review this PR or tell me who I could refer to ? Thanks !

jeffnvidia · 2026-01-15T13:11:04Z

@ericharper @yaoyu-33

Could you please add the "Run CICD" label to trigger the CI tests when you're ready to review?

Note: The reformat_with_isort_and_black check is currently failing due to a GitHub Actions issue (it can't find an old commit SHA after I rebased to add DCO sign-offs). I've verified locally that all formatting checks pass with python setup.py style.

Thanks!

jeffnvidia · 2026-01-29T14:15:31Z

@ericharper I changed the manifest, I think it needs to be re-run now (there is an error because of a commit change due to a force push)

jeffnvidia · 2026-02-02T13:49:01Z

Hey @ericharper @ko3n1g ,

I reverted the changes to the manifest and I added backward compatibility, let me know if I need to do more things now

Signed-off-by: jeffnvidia <jmahou@nvidia.com>

Add create_all_gather_group configuration option

c81db79

Signed-off-by: jeffnvidia <jmahou@nvidia.com>

jeffnvidia force-pushed the all_gather_param branch from 6155f07 to 549f330 Compare January 8, 2026 14:24

add unit tests

39678ae

Signed-off-by: jeffnvidia <jmahou@nvidia.com>

jeffnvidia force-pushed the all_gather_param branch from 549f330 to 39678ae Compare January 12, 2026 13:34

nithinraok requested review from ericharper and yaoyu-33 January 14, 2026 20:43

jeffnvidia force-pushed the all_gather_param branch from c536399 to 39678ae Compare January 15, 2026 13:05

ericharper previously approved these changes Jan 22, 2026

View reviewed changes

ericharper added the Run CICD label Jan 22, 2026

ericharper enabled auto-merge (squash) January 22, 2026 22:42

Merge branch 'main' into all_gather_param

66e1261

chtruong814 added Run CICD and removed Run CICD labels Jan 22, 2026

ericharper added Run CICD and removed Run CICD labels Jan 22, 2026

chtruong814 temporarily deployed to test January 22, 2026 22:45 — with GitHub Actions Inactive

ericharper temporarily deployed to test January 25, 2026 16:52 — with GitHub Actions Inactive

auto-merge was automatically disabled January 29, 2026 14:09
Head branch was pushed to by a user without write access

jeffnvidia dismissed ericharper’s stale review via a18bcd1 January 29, 2026 14:09

chtruong814 added Run CICD and removed Run CICD labels Jan 29, 2026

chtruong814 temporarily deployed to test January 29, 2026 14:13 — with GitHub Actions Inactive

jeffnvidia force-pushed the all_gather_param branch from a18bcd1 to 3273f29 Compare January 29, 2026 14:13

chtruong814 added Run CICD and removed Run CICD labels Jan 29, 2026

chtruong814 removed the Run CICD label Jan 29, 2026

ericharper temporarily deployed to test January 29, 2026 18:30 — with GitHub Actions Inactive

Merge remote-tracking branch 'origin/main' into all_gather_param

3b6ef6f

jeffnvidia force-pushed the all_gather_param branch from 6cdec35 to 3b6ef6f Compare February 2, 2026 13:19

chtruong814 added Run CICD and removed Run CICD labels Feb 2, 2026

add backward compatibility to megatron-lm

2d3a540

Signed-off-by: jeffnvidia <jmahou@nvidia.com>

jeffnvidia force-pushed the all_gather_param branch from a70e0cd to 2d3a540 Compare February 2, 2026 15:36

chtruong814 added Run CICD and removed Run CICD labels Feb 2, 2026

chtruong814 previously approved these changes Feb 3, 2026

View reviewed changes

chtruong814 temporarily deployed to test February 3, 2026 20:52 — with GitHub Actions Inactive

jeffnvidia dismissed chtruong814’s stale review via bc9551a February 4, 2026 13:20

chtruong814 added Run CICD and removed Run CICD labels Feb 4, 2026

chtruong814 temporarily deployed to test February 4, 2026 14:10 — with GitHub Actions Inactive

chtruong814 previously approved these changes Feb 4, 2026

View reviewed changes

chtruong814 enabled auto-merge (squash) February 4, 2026 14:12

remove test linked to megatron-lm version

8b1a4c4

Signed-off-by: jeffnvidia <jmahou@nvidia.com>

auto-merge was automatically disabled February 5, 2026 09:42
Head branch was pushed to by a user without write access

jeffnvidia dismissed chtruong814’s stale review via 8b1a4c4 February 5, 2026 09:42

jeffnvidia force-pushed the all_gather_param branch from bc9551a to 8b1a4c4 Compare February 5, 2026 09:42

chtruong814 added Run CICD and removed Run CICD labels Feb 5, 2026

chtruong814 approved these changes Feb 5, 2026

View reviewed changes

chtruong814 temporarily deployed to test February 5, 2026 11:43 — with GitHub Actions Inactive

chtruong814 merged commit 5e9d7f4 into NVIDIA-NeMo:main Feb 5, 2026
150 of 179 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add create_all_gather_group configuration option#15253

Add create_all_gather_group configuration option#15253
chtruong814 merged 6 commits intoNVIDIA-NeMo:mainfrom
jeffnvidia:all_gather_param

jeffnvidia commented Jan 5, 2026 •

edited

Loading

Uh oh!

jeffnvidia commented Jan 12, 2026

Uh oh!

jeffnvidia commented Jan 14, 2026

Uh oh!

jeffnvidia commented Jan 15, 2026

Uh oh!

jeffnvidia commented Jan 29, 2026

Uh oh!

jeffnvidia commented Feb 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jeffnvidia commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR related

What does this PR do ?

Changelog

Usage

Configure parallelism with all-gather group creation

Before your PR is "Ready for review"

Uh oh!

jeffnvidia commented Jan 12, 2026

Uh oh!

jeffnvidia commented Jan 14, 2026

Uh oh!

jeffnvidia commented Jan 15, 2026

Uh oh!

jeffnvidia commented Jan 29, 2026

Uh oh!

jeffnvidia commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jeffnvidia commented Jan 5, 2026 •

edited

Loading

jeffnvidia commented Feb 2, 2026 •

edited

Loading