Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POA: Pre-training Once for Models of All Sizes #13

Open
Dongwoo-Im opened this issue Nov 10, 2024 · 0 comments
Open

POA: Pre-training Once for Models of All Sizes #13

Dongwoo-Im opened this issue Nov 10, 2024 · 0 comments

Comments

@Dongwoo-Im
Copy link
Owner

Dongwoo-Im commented Nov 10, 2024

arxiv : https://arxiv.org/abs/2408.01031
github : https://github.com/alipay/POA


Motivation = weight-sharing strategy in NAS

Similar study

  • Cosub: Co-training 2L Submodels for Visual Recognition (CVPR 2023)
    • supervised learning
  • Weighted Ensemble Self-Supervised Learning (ICLR 2023)
    • each head contains an identical number of prototypes
    • and employs an averaging of cross-entropy loss which is weighted by the predictive entropies of each head

image

Cross-view distillation = teacher <-> intact student (elastic student)
Same-view distillation = intact student <-> elastic student


image

Depth = the number of blocks
Width = the number of channels

=> elastic widths and depths, we can generate a total of (N +1) x (M +1) distinct sub-networks

For training, use multiple projection heads (MPH)
For each head, the distillation loss LSi for both the intact and elastic student is calculated


Model

  • ViT (11 x 13)
  • Swin (3 x 13)
  • ResNet (3 x 155)
    • Probabilistic Sampling for Elastic Student

image

we derive the sub-networks ViT-S/16 and ViT-B/16 from the teacher ViT-L/16 without any additional pre-training
-> 그냥 Large 모델을 Distillation한 꼴이긴 하지만,, 이러한 학습 방식으로도 Small, Base 모델에서 안정적으로 성능이 나온다는 점이 중요한 듯

image

Detection good


How Does Elastic Student Facilitate Pre-training?

  • it acts as a training regularization to stabilize the training progress
  • Unlike existing self-distillation methods, the teacher in the POA SSL integrates a series of sub-networks through an EMA update
    -> 코드 뜯어보면 multi-crop, ibot(Masked Patch Tokens Prediction), dino 등 여러 SSL 학습 방법론을 중복해서 사용하더라
    -> 여러 개의 elastic student 중에서 샘플링해서 1개만 뽑는건가 싶다. (모든 elastic student를 업데이트 하는건 아무리 네트워크가 작아도 비효율적일 수도..)

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant