KL div v/s xentropy

The Hinton distillation paper states: 
"The first objective function is the cross entropy with the soft targets and this cross entropy is computed using the same high temperature in the softmax of the distilled model as was used for generating the soft targets from the cumbersome model. The second objective function is the cross entropy with the correct labels."

In https://github.com/szagoruyko/attention-transfer/blob/master/utils.py#L13-L15, the first objective function is computed using `kl_div` which is different from `cross_entropy`.
`kl_div` computes (- \sum t log(x/t))
`cross_entropy` computes (- \sum t log(x))
In general, `cross_entropy` is `kl_div` + `entropy(t)`

Did I misunderstand something, or did you use a slightly different loss in your implementation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KL div v/s xentropy #12

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

KL div v/s xentropy #12

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions