Skip to content

what is wrong when it execute “sh train.sh” #13

@lijiek

Description

@lijiek

sh train.sh
ubun:2375340:2375340 [0] NCCL INFO Bootstrap : Using [0]enp4s0:10.214.24.190<0>
ubun:2375340:2375340 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
ubun:2375340:2375340 [0] NCCL INFO NET/IB : No device found.
ubun:2375340:2375340 [0] NCCL INFO NET/Socket : Using [0]enp4s0:10.214.24.190<0>
ubun:2375340:2375340 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
ubun:2375341:2375341 [0] NCCL INFO Bootstrap : Using [0]enp4s0:10.214.24.190<0>
ubun:2375341:2375341 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
ubun:2375341:2375341 [0] NCCL INFO NET/IB : No device found.
ubun:2375341:2375341 [0] NCCL INFO NET/Socket : Using [0]enp4s0:10.214.24.190<0>
ubun:2375341:2375341 [0] NCCL INFO Using network Socket

ubun:2375341:2375387 [0] init.cc:573 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1000
ubun:2375341:2375387 [0] NCCL INFO init.cc:840 -> 5
ubun:2375341:2375387 [0] NCCL INFO group.cc:73 -> 5 [Async thread]

ubun:2375340:2375386 [0] init.cc:573 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1000
ubun:2375340:2375386 [0] NCCL INFO init.cc:840 -> 5
ubun:2375340:2375386 [0] NCCL INFO group.cc:73 -> 5 [Async thread]
Traceback (most recent call last):
File "runner.py", line 70, in
Traceback (most recent call last):
File "runner.py", line 70, in
main()
File "runner.py", line 50, in main
main()
File "runner.py", line 50, in main
init_distributed_mode()
File "runner.py", line 31, in init_distributed_mode
init_distributed_mode()
File "runner.py", line 31, in init_distributed_mode
torch.distributed.init_process_group(backend='nccl')
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
torch.distributed.init_process_group(backend='nccl')
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
barrier()
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ubuntu/anaconda3/envs/lrgt/bin/python', '-u', 'runner.py', '--local_rank=1']' returned non-zero exit status 1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions