-
Notifications
You must be signed in to change notification settings - Fork 3
Description
sh train.sh
ubun:2375340:2375340 [0] NCCL INFO Bootstrap : Using [0]enp4s0:10.214.24.190<0>
ubun:2375340:2375340 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
ubun:2375340:2375340 [0] NCCL INFO NET/IB : No device found.
ubun:2375340:2375340 [0] NCCL INFO NET/Socket : Using [0]enp4s0:10.214.24.190<0>
ubun:2375340:2375340 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
ubun:2375341:2375341 [0] NCCL INFO Bootstrap : Using [0]enp4s0:10.214.24.190<0>
ubun:2375341:2375341 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
ubun:2375341:2375341 [0] NCCL INFO NET/IB : No device found.
ubun:2375341:2375341 [0] NCCL INFO NET/Socket : Using [0]enp4s0:10.214.24.190<0>
ubun:2375341:2375341 [0] NCCL INFO Using network Socket
ubun:2375341:2375387 [0] init.cc:573 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1000
ubun:2375341:2375387 [0] NCCL INFO init.cc:840 -> 5
ubun:2375341:2375387 [0] NCCL INFO group.cc:73 -> 5 [Async thread]
ubun:2375340:2375386 [0] init.cc:573 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1000
ubun:2375340:2375386 [0] NCCL INFO init.cc:840 -> 5
ubun:2375340:2375386 [0] NCCL INFO group.cc:73 -> 5 [Async thread]
Traceback (most recent call last):
File "runner.py", line 70, in
Traceback (most recent call last):
File "runner.py", line 70, in
main()
File "runner.py", line 50, in main
main()
File "runner.py", line 50, in main
init_distributed_mode()
File "runner.py", line 31, in init_distributed_mode
init_distributed_mode()
File "runner.py", line 31, in init_distributed_mode
torch.distributed.init_process_group(backend='nccl')
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
torch.distributed.init_process_group(backend='nccl')
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
barrier()
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ubuntu/anaconda3/envs/lrgt/bin/python', '-u', 'runner.py', '--local_rank=1']' returned non-zero exit status 1.