-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Image resolution at intermediate layers? #252
Comments
This is a vision transformer, hence the image resolution is the same throught the whole network. There is not pooling layers like in CNN. However, each token corresponds to a patch size 8x8, hence the feature map resolution is 28x28. |
Perhaps my question was I'll-formulated. I meant the feature map, as you said. Could you tell me how you reached the number 28 through your calculation? |
The input image has size 224x224, hence you divide each dimension by 8 to obtain features maps of size 28x28. If you choose another patch size (different from 8x8), it may change. If you look at the embeddings given by the model for one image, you get a tensor of shape (785,768). This is because 785=1+28*28 (there is a CLS token added in front of the 28x28=784 tokens of the feature map). 768 is the hidden dimension (at least with the vitb8 model). If you want to obtain the "image-like" feature maps, you can get rid of the CLS token and reshape the tensor by e.g, assuming :
Above snipped code may change slightly if you deal with batched images (add a dimension for the batch then), or another patch size or hidden dimension depending on the model. (Please note that I am not a creator of this github, I only provide what I understood from the architecture because I'm currently also digging into DINOV2) |
As it is already clear, the images are resized to 224 by 224 before being fed into Dino. While doing work currently, where I use the intermediate layers' features, specifically layer 9, what is the image resolution at that layer, or any other layer for the sake of the question?
@mathildecaron31 Any help would be appreciated. :)
The text was updated successfully, but these errors were encountered: