TorchDistPackage provides some easy-to-use modules and tools for Distributed Training in PyTorch.
It is under construction. Welcome to use and contribute.
- install
git clone https://github.com/KimmiShi/TorchDistPackage.git
cd TorchDistPackage
pip install -e . # or pip install . --user- simple example
import torch
from torchdistpackage import setup_distributed,test_comm,tpc
# init torch disttributed
setup_distributed()
# init process groups
pp_size=2
tp_size=2
dist_config = [('data',world_size/(2*pp_size)), ('pipe',pp_size), ('tensor',tp_size)]
tpc.setup_process_groups(dist_config)
# test communication in groups
tmp = torch.rand([100,1024]).cuda()
# collective
dist.broadcast(tmp, tpc.get_ranks_in_group('model')[0], tpc.get_group('model'))
# p2p
if tpc.is_first_in_pipeline_group():
dist.send(tmp, tpc.get_next_global_rank('pipe'))
if tpc.is_last_in_pipeline_group():
dist.recv(tmp, tpc.get_prev_global_rank('pipe'))example: TestNaiveDdp
Highlights:
- Python only implementation. Easy to understand and debug.
- overlaps grad reduce with compute like TorchDDP
- For Pipeline Parallelism, only reduce grad at the last micro-batch; and could still overlap comm, which is better than ColossalAI impl.
Drawbacks/TODO:
- the all-reduce launch seems to take more time than TorchDDP in some model
详见主要特性介绍
- 自定义fwd_fn,bwd_fn的1F1B调度器 pipeline scheduler
- pipeline model partition 流水并行模型切分
在专家并行(Expert Parallel)的基础上,支持 MoE 数据并行:即复制一些expert,相同的expert之间做数据并行(初始参数广播,梯度平均),不同的expert之间做专家并行。
简单的TP实现。
详见主要特性介绍
节省EMA的显存消耗,见sharded ema example
分级时间和显存消耗 参考