Image resolution at intermediate layers? #252

yousafe007 · 2023-09-12T14:16:50Z

As it is already clear, the images are resized to 224 by 224 before being fed into Dino. While doing work currently, where I use the intermediate layers' features, specifically layer 9, what is the image resolution at that layer, or any other layer for the sake of the question?

@mathildecaron31 Any help would be appreciated. :)

tcourat · 2024-03-05T13:24:47Z

This is a vision transformer, hence the image resolution is the same throught the whole network. There is not pooling layers like in CNN. However, each token corresponds to a patch size 8x8, hence the feature map resolution is 28x28.

yousafe007 · 2024-03-05T14:36:44Z

This is a vision transformer, hence the image resolution is the same throught the whole network. There is not pooling layers like in CNN. However, each token corresponds to a patch size 8x8, hence the feature map resolution is 28x28.

Perhaps my question was I'll-formulated. I meant the feature map, as you said. Could you tell me how you reached the number 28 through your calculation?

tcourat · 2024-03-05T14:42:51Z

The input image has size 224x224, hence you divide each dimension by 8 to obtain features maps of size 28x28. If you choose another patch size (different from 8x8), it may change.

If you look at the embeddings given by the model for one image, you get a tensor of shape (785,768). This is because 785=1+28*28 (there is a CLS token added in front of the 28x28=784 tokens of the feature map). 768 is the hidden dimension (at least with the vitb8 model).

If you want to obtain the "image-like" feature maps, you can get rid of the CLS token and reshape the tensor by e.g, assuming :

fmap = fmap[1:,:] # Keep every token except the first
fmap = fmap.reshape(28,28,768)

Above snipped code may change slightly if you deal with batched images (add a dimension for the batch then), or another patch size or hidden dimension depending on the model.

(Please note that I am not a creator of this github, I only provide what I understood from the architecture because I'm currently also digging into DINOV2)

Catchip mentioned this issue Nov 28, 2024

How to get the feature map by using pretrained model. #239

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image resolution at intermediate layers? #252

Image resolution at intermediate layers? #252

yousafe007 commented Sep 12, 2023

tcourat commented Mar 5, 2024

yousafe007 commented Mar 5, 2024

tcourat commented Mar 5, 2024 •

edited

Loading

Image resolution at intermediate layers? #252

Image resolution at intermediate layers? #252

Comments

yousafe007 commented Sep 12, 2023

tcourat commented Mar 5, 2024

yousafe007 commented Mar 5, 2024

tcourat commented Mar 5, 2024 • edited Loading

tcourat commented Mar 5, 2024 •

edited

Loading