You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi again, sorry for the slow response in issue #26. I have some more clarifications and visualizations here.
I agree that the sine-cosine embeddings are not learnable. However it seems like they still need to be interpolated for the model to work well. I suspect that this is at least partially due to the fact that they are 1d, and thus the model has to learn the number of rows/columns. E.g. it cannot express "look one patch down" directly, but rather needs to express it as "look X patches forward". And X changes if we change resolution.
I have attached attention visualizations that show what happens if you run on higher res with or without interpolating the positional embedding. As you can see, the non-interpolated version looks much worse and has weird diagonal stripes.
This is not a major issue to me, but I wanted to let you (and anyone else that has the same problem) know about this. I think the best solution is what I mentioned before: to simply include the positional embeddings in the checkpoint even though they are not learnable parameters.
Original:
With interpolation:
Without interpolation:
The text was updated successfully, but these errors were encountered:
Hi again, sorry for the slow response in issue #26. I have some more clarifications and visualizations here.
I agree that the sine-cosine embeddings are not learnable. However it seems like they still need to be interpolated for the model to work well. I suspect that this is at least partially due to the fact that they are 1d, and thus the model has to learn the number of rows/columns. E.g. it cannot express "look one patch down" directly, but rather needs to express it as "look X patches forward". And X changes if we change resolution.
I have attached attention visualizations that show what happens if you run on higher res with or without interpolating the positional embedding. As you can see, the non-interpolated version looks much worse and has weird diagonal stripes.
This is not a major issue to me, but I wanted to let you (and anyone else that has the same problem) know about this. I think the best solution is what I mentioned before: to simply include the positional embeddings in the checkpoint even though they are not learnable parameters.
Original:
With interpolation:
Without interpolation:
The text was updated successfully, but these errors were encountered: