We know that for qkv attention, the result of q @ k should be divided by sqrt(d), will this also be same for efficientVit?
Does relu-based-linear-attention need layernorm or position embedding?
Does relu-based-linear-attention need multi-head attention?