[Question] How does the gate coordinate across ranks in expert parallelism?

Hi, I'm trying to understand how the Gate module works in Tutel's MoE implementation.

Since each rank only maintains a subset of experts (num_experts_per_device), but the Gate output seems to be shaped across the total number of experts globally, I'm curious about how gates work across different ranks.

Specifically:

Does each rank need to maintain the same gate output?

Is there any communication happening after the gate, such as All-to-All, to route tokens to the correct experts?

The reason I'm asking is that after training the model, I noticed that the Gate parameters are not identical across different ranks. I would like to ask whether this behavior is expected or indicates a problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] How does the gate coordinate across ranks in expert parallelism? #278

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] How does the gate coordinate across ranks in expert parallelism? #278

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions