Replacing the LlamaDecoderLayer Class hugging Face With New LongNet #94
Replies: 1 comment
-
Thanks for posting, but questions about the Hugging Face transformer library would be out of the scope for this book. I think this question might be a better fit for the Hugging Face forums. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey i hope you are doing Great this weekend
i would like to ask you Please a Technical Question !!
i working on the CodeLLama Model which Uses a Decoder-Only Model Transformer following Arch Blow
Main Task is replaced
Decoder-Only
which used Masked-Self-Attention and KV_cache with my ownEncoder-Only
which used Diltaed-Attention used in LongNethere the code Based on
I planned to Replace the Block of
LlamaDecoderLayer
following withinEncoder-only
here the Origin BlockDecoder-Only
used inCodeLlama
:with my own using Inherent from base Class From Hugging Face Here my Following Process i did to Replace with
Encoder-only
Step 1 : Inherent From LlamaConfig To adjust the new parameters config used in my own Encoder model which used
Dilated Multi-heads Attention
Output :
Step 2 : the only part i wanted to Replace is
self_attn
and my own Multi-head-Dilaed Attention is following isLongNet
based Mechanism following code BlowHere
the Dilated Attention
usedflash_Attention_2
is Optional based on GPU used arch supportA100
orT4 GPU
Here The Multi-head Dilated Attention
To do so and Repalce the Layer used Inherent base Class from Hugging face
Notation: As long as
is_causal=None
the learning of the Attention Mechanism is not masked which leads int Fully Learning Representation to produce the Embedding Space of Vectors of Tokens which means theEncoder-Only
learns the feature Representation relevant between Tokens attended to Druing Dot-Product Similarity instead of `Decoder-Only used Masked-Attention which I am not interested to use at the pointStep 4 : ReConstructed the Model using Adjustment
Config Class
I did the followingNotation: i adjusted
num_hidden_layers
only for show caseconfig.num_hidden_layers = 2
the origin param isnum_hidden_layers=32
Notation: i didn't use Rotary Embedding Because of Attention used is Linear
Q 1 Correct me Please if i need to keep
Rotary Embedding
in myEncoder-Only
Output:
Finally Step: Transfer Learning The Weights Layers following
["q_proj", "k_proj", "v_proj", "o_proj"]
FromDecoder-Only
to `Encoder-Only``Here Comparing the New
Encoder-Only
withDecoder-Only
Decoder-Only used in CodeLlama
Encoder-Only used in CodeLlama with Adujsment i did
both are has similar linear Layers in the following
["q_proj", "k_proj", "v_proj", "o_proj"]
the code i built to do Transfering the Weights
Output
Please Correct me if missed understanding anything Transform the CodeLlama to be Encoder-Only to learn the Embedding
Thank you so much for your advance
Beta Was this translation helpful? Give feedback.
All reactions