TorchDistPackage

TorchDistPackage provides some easy-to-use modules and tools for Distributed Training in PyTorch.

It is under construction. Welcome to use and contribute.

安装使用

install

git clone https://github.com/KimmiShi/TorchDistPackage.git
cd TorchDistPackage
pip install -e . # or pip install . --user

simple example

import torch
from torchdistpackage import setup_distributed,test_comm,tpc

# init torch disttributed
setup_distributed()

# init process groups
pp_size=2
tp_size=2
dist_config = [('data',world_size/(2*pp_size)), ('pipe',pp_size), ('tensor',tp_size)]
tpc.setup_process_groups(dist_config)

# test communication in groups
tmp = torch.rand([100,1024]).cuda()

# collective
dist.broadcast(tmp, tpc.get_ranks_in_group('model')[0], tpc.get_group('model'))

# p2p
if tpc.is_first_in_pipeline_group():
    dist.send(tmp, tpc.get_next_global_rank('pipe'))
if tpc.is_last_in_pipeline_group():
    dist.recv(tmp, tpc.get_prev_global_rank('pipe'))

特性介绍

0. 简单的纯Python实现DDP - Simple DDP Module in PyTorch

example: TestNaiveDdp

Highlights:

Python only implementation. Easy to understand and debug.
overlaps grad reduce with compute like TorchDDP
For Pipeline Parallelism, only reduce grad at the last micro-batch; and could still overlap comm, which is better than ColossalAI impl.

Drawbacks/TODO:

the all-reduce launch seems to take more time than TorchDDP in some model

1. 从slurm初始化torch distributed - torch_launch_from_slurm

torch dist init from slurm

example

2. 灵活的通信组划分 - Flexible process group initialization for Mixed Parallelism

详见主要特性介绍

3. 流水并行相关 - For Pipeline Parallelism

自定义fwd_fn,bwd_fn的1F1B调度器 pipeline scheduler
pipeline model partition 流水并行模型切分

使用示例测例参考

4. MoE-数据并行

在专家并行(Expert Parallel)的基础上，支持 MoE 数据并行：即复制一些expert，相同的expert之间做数据并行（初始参数广播，梯度平均），不同的expert之间做专家并行。

使用示例

5. Tensor Parallel & Sequence Parallel

简单的TP实现。

测例参考

6. Hybrid ZeRO / 节点内ZeRO - 加速ZeRO多卡训练速度

详见主要特性介绍

7. 分片EMA - sharded EMA

节省EMA的显存消耗，见sharded ema example

TOOLS 工具类

1. model profiler

分级时间和显存消耗参考

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
docs		docs
examples		examples
explore		explore
tools		tools
torchdistpackage		torchdistpackage
.gitignore		.gitignore
Intro.md		Intro.md
LICENSE		LICENSE
Readme.md		Readme.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TorchDistPackage

安装使用

特性介绍

0. 简单的纯Python实现DDP - Simple DDP Module in PyTorch

1. 从slurm初始化torch distributed - torch_launch_from_slurm

2. 灵活的通信组划分 - Flexible process group initialization for Mixed Parallelism

3. 流水并行相关 - For Pipeline Parallelism

4. MoE-数据并行

5. Tensor Parallel & Sequence Parallel

6. Hybrid ZeRO / 节点内ZeRO - 加速ZeRO多卡训练速度

7. 分片EMA - sharded EMA

TOOLS 工具类

1. model profiler

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

KimmiShi/TorchDistPackage

Folders and files

Latest commit

History

Repository files navigation

TorchDistPackage

安装使用

特性介绍

0. 简单的纯Python实现DDP - Simple DDP Module in PyTorch

1. 从slurm初始化torch distributed - torch_launch_from_slurm

2. 灵活的通信组划分 - Flexible process group initialization for Mixed Parallelism

3. 流水并行相关 - For Pipeline Parallelism

4. MoE-数据并行

5. Tensor Parallel & Sequence Parallel

6. Hybrid ZeRO / 节点内ZeRO - 加速ZeRO多卡训练速度

7. 分片EMA - sharded EMA

TOOLS 工具类

1. model profiler

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages