-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to load 32-experts Swin-transformer-moe on a 2-GPU machine. #248
Comments
If I load the Swin-Transformer-MOE checkpoint directly, an error occurs." Or, to be a bit more detailed : load_pretrained(config, model_without_ddp, logger) |
The pretrained checkpoint may be old which was compatible with a legacy Tutel version. Can you provide the checkpoint link you use, and SWIN command to load it if possible? |
Hi, I have loaded the checkpoint from (https://github.com/SwinTransformer/storage/releases/download/v2.0.2/swin_moe_small_patch4_window12_192_32expert_32gpu_22k.zip). I used the command: torchrun --nproc_per_node=1 --nnode=1 --master_port 12347 main_moe.py --cfg configs/swinmoe/swin_moe_small_patch4_window12_192_32expert_32gpu_1k 128.yaml --data-path imagenet --batch-size 128 --pretrained swin_moe_small_patch4_window12_192_32expert_32gpu_22k/swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth |
Thank you for your modifications! I upgraded the Tutel installation and followed the steps to merge the checkpoint. However, the issue still persists. Upon printing the merged checkpoint, I noticed a mismatch with the model. erorr: Therefore, it seems that the dimensions being merged may be incorrect, and not all the parameters of the experts have been combined properly? |
I think you may not follow the instructions correctly. The examples in tutorial should exactly work for your zip file. You should get a new checkpoint folder new_swin_moe_small_for_2_gpus/ containing 3 files in total. Please check if your new checkpoint folder corresponds with the file list below: $ ls -ls new_swin_moe_small_for_2_gpus/
total 2121952
202872 -rw-r--r-- 1 root root 207739264 Oct 28 04:20 swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth.master
959540 -rw-r--r-- 1 root root 982561440 Oct 28 04:19 swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth.rank0
959540 -rw-r--r-- 1 root root 982561440 Oct 28 04:19 swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth.rank1 |
Feel free to let us know if the issue is still unsolved. |
Thank you very much. The issue has been solved. |
Hi,
I have downloaded the checkpoint for a 32-expert Swin-Transformer-MOE. However, the checkpoints are dependent sub-checkpoints distributed across different ranks (32 ranks). I want to load these sub-checkpoints and fine-tune the model on a 2-GPU machine.
To do this, should I gather the sub-checkpoints into a single checkpoint? I attempted to use the script from gather.py, but it did not work. Could you help me understand what went wrong?
Additionally, I checked the original code and found that the condition "if k.endswith('._num_global_experts')" is returning false. Is this due to the format of the Swin-Transformer-MOE checkpoint? I'm quite confused about this.
Thank you for your assistance!
The text was updated successfully, but these errors were encountered: