|
unscaled_loss = torch.nn.functional.cross_entropy( |
Thanks for your great job! I have a question about the loss weighing and reduction.
LLaDA and SDAR use weigted loss, they multiple the loss by 1/p_mask, but here you just use a uniform loss across all masked positions.
Do you know their difference? Which one is more stable? Looking forward to your reply!