Add create_all_gather_group configuration option#15253
Add create_all_gather_group configuration option#15253chtruong814 merged 6 commits intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: jeffnvidia <jmahou@nvidia.com>
6155f07 to
549f330
Compare
Signed-off-by: jeffnvidia <jmahou@nvidia.com>
549f330 to
39678ae
Compare
|
The formatting check seems to be failing due to a Git history issue after force-pushing to add sign-offs, not actual formatting problems. I've verified locally that both isort and black checks pass. The workflows are awaiting maintainer approval to run properly. |
|
Hi @nithinraok, could you review this PR or tell me who I could refer to ? Thanks ! |
c536399 to
39678ae
Compare
|
Could you please add the "Run CICD" label to trigger the CI tests when you're ready to review? Note: The Thanks! |
Head branch was pushed to by a user without write access
a18bcd1 to
3273f29
Compare
|
@ericharper I changed the manifest, I think it needs to be re-run now (there is an error because of a commit change due to a force push) |
6cdec35 to
3b6ef6f
Compare
|
Hey @ericharper @ko3n1g , I reverted the changes to the manifest and I added backward compatibility, let me know if I need to do more things now |
Signed-off-by: jeffnvidia <jmahou@nvidia.com>
a70e0cd to
2d3a540
Compare
Signed-off-by: jeffnvidia <jmahou@nvidia.com>
Head branch was pushed to by a user without write access
bc9551a to
8b1a4c4
Compare
Important
The
Update branchbutton must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
PR related
This PR is directly related to this megatron-lm PR : NVIDIA/Megatron-LM#2663
What does this PR do ?
Adds
create_all_gather_groupconfiguration option to enable overlapping reduce-scatter and all-gather.Collection: Core / Distributed Training
Changelog
create_all_gather_groupparameter toParallelismConfigclass inmegatron_strategy.pycreate_all_gather_groupparameter toMegatronStrategyclass initialization and setupcreate_all_gather_groupproperty and setter toAppStateclass inapp_state.pycreate_all_gather_groupconfiguration throughinit_parallel_ranks()andinitialize_model_parallel_for_nemo()functionsUsage
Users can now enable overlapping of reduce-scatter and all-gather operations by setting the configuration:
from nemo.lightning.pytorch.strategies import MegatronStrategy, ParallelismConfig
Configure parallelism with all-gather group creation
parallel_config = ParallelismConfig(
tensor_model_parallel_size=2,
pipeline_model_parallel_size=2,
create_all_gather_group=True, # Enable separate all-gather process group
)
strategy = MegatronStrategy(
parallel_config=parallel_config,
create_all_gather_group=True,
)# GitHub Actions CI
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks: