Multiply hidden_states by normalizer or dividing by it #3

George614 · 2024-09-30T01:32:27Z

Hi Umar,

I absolutely love your YT video explaining the PaliGemma model and thanks for all the good work! I found this line which seems be contradictory to what you said in the video (which is basically to control / reduce its variance such that it does not grow as the text / image embedding dimensions grow). Is this a bug or an intentional scaling for the hidden states?

Best,
George

KevinHooah · 2024-11-04T21:20:43Z

I think this is from the HF's gemma implementation. But this is never mentioned in Gemma/Gemma2 technical reports, so I guess it is some magic lol.

MostHumble · 2024-11-11T15:40:39Z

@George614 @KevinHooah probably for similar reasons on why it's done in the attention mechansim: https://sifal.social/posts/Attention-scores,-Scaling-and-Softmax/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiply hidden_states by normalizer or dividing by it #3

Multiply hidden_states by normalizer or dividing by it #3

George614 commented Sep 30, 2024

KevinHooah commented Nov 4, 2024

MostHumble commented Nov 11, 2024 •

edited

Loading

Multiply hidden_states by normalizer or dividing by it #3

Multiply hidden_states by normalizer or dividing by it #3

Comments

George614 commented Sep 30, 2024

KevinHooah commented Nov 4, 2024

MostHumble commented Nov 11, 2024 • edited Loading

MostHumble commented Nov 11, 2024 •

edited

Loading