Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiply hidden_states by normalizer or dividing by it #3

Open
George614 opened this issue Sep 30, 2024 · 2 comments
Open

Multiply hidden_states by normalizer or dividing by it #3

George614 opened this issue Sep 30, 2024 · 2 comments

Comments

@George614
Copy link

Hi Umar,

I absolutely love your YT video explaining the PaliGemma model and thanks for all the good work! I found this line which seems be contradictory to what you said in the video (which is basically to control / reduce its variance such that it does not grow as the text / image embedding dimensions grow). Is this a bug or an intentional scaling for the hidden states?

Best,
George

@KevinHooah
Copy link

I think this is from the HF's gemma implementation. But this is never mentioned in Gemma/Gemma2 technical reports, so I guess it is some magic lol.

@MostHumble
Copy link
Contributor

MostHumble commented Nov 11, 2024

@George614 @KevinHooah probably for similar reasons on why it's done in the attention mechansim: https://sifal.social/posts/Attention-scores,-Scaling-and-Softmax/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants