scaling of attention in EPA module #80

maunzzz · 2024-10-08T10:24:05Z

Hi!
I was wondering why the dot product attention is scaled by a learnable(?) weight (self.temperature, self.temperature2) before softmax instead of 1/sqrt(d) as described in the paper?

Best Regards,
Måns

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scaling of attention in EPA module #80

scaling of attention in EPA module #80

maunzzz commented Oct 8, 2024

scaling of attention in EPA module #80

scaling of attention in EPA module #80

Comments

maunzzz commented Oct 8, 2024