Skip to content

[Question] How does the gate coordinate across ranks in expert parallelism? #278

@wangyaojlu

Description

@wangyaojlu

Hi, I'm trying to understand how the Gate module works in Tutel's MoE implementation.

Since each rank only maintains a subset of experts (num_experts_per_device), but the Gate output seems to be shaped across the total number of experts globally, I'm curious about how gates work across different ranks.

Specifically:

Does each rank need to maintain the same gate output?

Is there any communication happening after the gate, such as All-to-All, to route tokens to the correct experts?

The reason I'm asking is that after training the model, I noticed that the Gate parameters are not identical across different ranks. I would like to ask whether this behavior is expected or indicates a problem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions