Skip to content

Adding the XCCL DPU team, and DPU daemon#106

Open
janjust wants to merge 2 commits intoopenucx:masterfrom
janjust:xccl-dpu-team
Open

Adding the XCCL DPU team, and DPU daemon#106
janjust wants to merge 2 commits intoopenucx:masterfrom
janjust:xccl-dpu-team

Conversation

@janjust
Copy link

@janjust janjust commented Jan 6, 2021

This PR adds the new DPU team as well as a contrib directory with the accompanying DPU daemon app.

This is a first but comprehensive attempt which successfully runs pytorch param-comms benchmark.
Tested over 32 bluefield enabled nodes.

There are several configury options to keep in mind when running.

new config options:
--with-dpu=yes

client/host side:
two new flags and additional dpu parameter for TLS:
-x TORCH_UCC_TLS=dpu
-x XCCL_TEAM_DPU_ENABLE=1
-x XCCL_TEAM_DPU_HOST_DPU_LIST=

the host_dpu_list file is a 1 to 1 mapping host file that dpu team will use to identify the IP address of his DPU.
eg:
host1 dpu1
host2 dpu2
etc.

dpu side:
-x DPU_DATA_BUFFER_SIZE=$((16 * 1024 * 1024))
En environment variable that sets the buffer size available on the DPU.
If not provided, default is 16MB.
./dpu_server <threads (int)> by default it will use a single thread.

eg.
mpirun -np 4 --map-by ppr:1:node -x UCX_NET_DEVICES=mlx5_0:1 -x XCCL_TEST_TLS=ucx --bind-to none --report-bindings --tag-output -hostfile file.dpus -x LD_LIBRARY_PATH  ./dpu_server 4

Signed-off-by: Tomislavj Janjusic tomislavj@nvidia.com

Co-authored-by: Artem Polyakov artpol84@gmail.com
Sergey Lebedev sergeyle@nvidia.com

@janjust
Copy link
Author

janjust commented Jan 6, 2021

@manjugv @Sergei-Lebedev @vspetrov
Hey guys - this is PR which adds the DPU team, developed during the hackathon by @artpol84 @Sergei-Lebedev and me.

It's the first attempt that successfully runs, but obviously needs strong vetting.
We did preliminary data-checks with the xccl allreduce tests, seems to pass - and it successfully runs the pytorch param/comms bench.

@artpol84
Copy link

artpol84 commented Jan 7, 2021

@janjust please change the commit message as follows:

Co-authored-by: Artem Polyakov <artpol84@gmail.com>
Co-authored-by: Sergey Lebedev <sergeyle@nvidia.com>

Per https://docs.github.com/en/free-pro-team@latest/github/committing-changes-to-your-project/creating-a-commit-with-multiple-authors

@artpol84
Copy link

artpol84 commented Jan 7, 2021

I tried it out of curiosity and it works as expected:
artpol84@91a6466

Signed-off-by: Tomislavj Janjusic <tomislavj@nvidia.com>

Co-authored-by: Artem Polyakov <artpol84@gmail.com>
Co-authored-by: Sergey Lebedev <sergeyle@nvidia.com>
@janjust
Copy link
Author

janjust commented Jan 7, 2021

@janjust please change the commit message as follows:

Co-authored-by: Artem Polyakov <artpol84@gmail.com>
Co-authored-by: Sergey Lebedev <sergeyle@nvidia.com>

Per https://docs.github.com/en/free-pro-team@latest/github/committing-changes-to-your-project/creating-a-commit-with-multiple-authors

done

Signed-off-by: Tomislavj Janjusic <tommy.janjusic@gmail.com>

Co-authored-by: Artem Polyakov <artpol84@gmail.com>
Co-authored-by: Sergey Lebedev <sergeyle@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants