-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why not using DW conv #2
Comments
Hi @bonlime , This paper does not target designing token mixers to achieve SOTA performance. Instead, we want to demonstrate that the competence of transformer/MLPlike models primarily stems from the general architecture MetaFormer instead of the equipped specific token mixers. Thus, what we want is the most simple token mixer. AvgPool2d (pool size 3x3) is more simple than DW, so we select it in this paper to support our claim. If you directly replace AvgPool2d with DW, it will also work very well. |
So, what exactly is all you need? patches, specific network structure, or just numerous parameters... |
Hi @Sumching , In this paper, we claim "MetaFormer is actually what you need". We do not use the word "all". |
Interesting! I replaced the attention in Uformer with pooling and achieved the same effect on low-level tasks. |
Wow, interesting! Maybe you can write a tech report to show this finding. |
Hi, thanks for the paper.
While your paper does again show that any mixing in spatial domain could work in CV, from practical point of view there is a large issue with using AvgPool2d. On inference it's not faster that DepthwiseConv but using a fixed filter instead of learned one, which leads to much lower network capacity. Have you tried using DW 3x3 instead of AvgPool ?
The text was updated successfully, but these errors were encountered: