You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the original paper the authors suggest adding positional encodings to speech and text representations before the transformer block. I noticed that in your code positional encodings are commented. Have you tried to train model with positional encodings and ,if so, is there any difference in performance?
The text was updated successfully, but these errors were encountered:
I changed the implementation slightly there. The authors use an encoder only transformer and so they needed to add different positional embeddings for the text and speech, while I use a full encoder decoder model (which internally uses different set of pos embeddings). Doing anyway is alright and results depend on your training data and other factors. It's just your preference, I leave it as comment for anyone to use it.
In the original paper the authors suggest adding positional encodings to speech and text representations before the transformer block. I noticed that in your code positional encodings are commented. Have you tried to train model with positional encodings and ,if so, is there any difference in performance?
The text was updated successfully, but these errors were encountered: