Hi, I'm trying to understand how the Gate module works in Tutel's MoE implementation.
Since each rank only maintains a subset of experts (num_experts_per_device), but the Gate output seems to be shaped across the total number of experts globally, I'm curious about how gates work across different ranks.
Specifically:
Does each rank need to maintain the same gate output?
Is there any communication happening after the gate, such as All-to-All, to route tokens to the correct experts?
The reason I'm asking is that after training the model, I noticed that the Gate parameters are not identical across different ranks. I would like to ask whether this behavior is expected or indicates a problem.