You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Depth = the number of blocks
Width = the number of channels
=> elastic widths and depths, we can generate a total of (N +1) x (M +1) distinct sub-networks
For training, use multiple projection heads (MPH)
For each head, the distillation loss LSi for both the intact and elastic student is calculated
Model
ViT (11 x 13)
Swin (3 x 13)
ResNet (3 x 155)
Probabilistic Sampling for Elastic Student
we derive the sub-networks ViT-S/16 and ViT-B/16 from the teacher ViT-L/16 without any additional pre-training
-> 그냥 Large 모델을 Distillation한 꼴이긴 하지만,, 이러한 학습 방식으로도 Small, Base 모델에서 안정적으로 성능이 나온다는 점이 중요한 듯
Detection good
How Does Elastic Student Facilitate Pre-training?
it acts as a training regularization to stabilize the training progress
Unlike existing self-distillation methods, the teacher in the POA SSL integrates a series of sub-networks through an EMA update
-> 코드 뜯어보면 multi-crop, ibot(Masked Patch Tokens Prediction), dino 등 여러 SSL 학습 방법론을 중복해서 사용하더라
-> 여러 개의 elastic student 중에서 샘플링해서 1개만 뽑는건가 싶다. (모든 elastic student를 업데이트 하는건 아무리 네트워크가 작아도 비효율적일 수도..)
The text was updated successfully, but these errors were encountered:
arxiv : https://arxiv.org/abs/2408.01031
github : https://github.com/alipay/POA
Motivation = weight-sharing strategy in NAS
Similar study
Cross-view distillation = teacher <-> intact student (elastic student)
Same-view distillation = intact student <-> elastic student
Depth = the number of blocks
Width = the number of channels
=> elastic widths and depths, we can generate a total of (N +1) x (M +1) distinct sub-networks
For training, use multiple projection heads (MPH)
For each head, the distillation loss LSi for both the intact and elastic student is calculated
Model
we derive the sub-networks ViT-S/16 and ViT-B/16 from the teacher ViT-L/16 without any additional pre-training
-> 그냥 Large 모델을 Distillation한 꼴이긴 하지만,, 이러한 학습 방식으로도 Small, Base 모델에서 안정적으로 성능이 나온다는 점이 중요한 듯
Detection good
How Does Elastic Student Facilitate Pre-training?
-> 코드 뜯어보면 multi-crop, ibot(Masked Patch Tokens Prediction), dino 등 여러 SSL 학습 방법론을 중복해서 사용하더라
-> 여러 개의 elastic student 중에서 샘플링해서 1개만 뽑는건가 싶다. (모든 elastic student를 업데이트 하는건 아무리 네트워크가 작아도 비효율적일 수도..)
The text was updated successfully, but these errors were encountered: