Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to load 32-experts Swin-transformer-moe on a 2-GPU machine. #248

Open
ywxsuperstar opened this issue Oct 27, 2024 · 8 comments
Open

Comments

@ywxsuperstar
Copy link

Hi,

I have downloaded the checkpoint for a 32-expert Swin-Transformer-MOE. However, the checkpoints are dependent sub-checkpoints distributed across different ranks (32 ranks). I want to load these sub-checkpoints and fine-tune the model on a 2-GPU machine.

To do this, should I gather the sub-checkpoints into a single checkpoint? I attempted to use the script from gather.py, but it did not work. Could you help me understand what went wrong?

Additionally, I checked the original code and found that the condition "if k.endswith('._num_global_experts')" is returning false. Is this due to the format of the Swin-Transformer-MOE checkpoint? I'm quite confused about this.

Thank you for your assistance!

@ywxsuperstar
Copy link
Author

If I load the Swin-Transformer-MOE checkpoint directly, an error occurs."

Or, to be a bit more detailed : load_pretrained(config, model_without_ddp, logger)
[rank0]: File "/ai_home/data/private/ywx/Swin-Transformer/utils_moe.py", line 217, in load_pretrained
[rank0]: msg = model.load_state_dict(state_dict, strict=False)
[rank0]: File "/opt/conda/envs/tutel/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict
[rank0]: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
[rank0]: RuntimeError: Error(s) in loading state_dict for SwinTransformerMoE:
[rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]).
[rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]).
[rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]).
[rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]).
[rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]).
[rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]).
[rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]).
[rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]).
[rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]).
[rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]).
[rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]).
[rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]).
[rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]).
[rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]).
[rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]).
[rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]).
[rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]).
[rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]).
[rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 3072, 768]) from checkpoint, the shape in current model is torch.Size([32, 3072, 768]).
[rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 3072, 768]) from checkpoint, the shape in current model is torch.Size([32, 3072, 768]).
[rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 3072]) from checkpoint, the shape in current model is torch.Size([32, 3072]).
[rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 768]) from checkpoint, the shape in current model is torch.Size([32, 768]).

@ghostplant
Copy link
Contributor

ghostplant commented Oct 27, 2024

The pretrained checkpoint may be old which was compatible with a legacy Tutel version. Can you provide the checkpoint link you use, and SWIN command to load it if possible?

@ywxsuperstar
Copy link
Author

The pretrained checkpoint may be old which was compatible with a legacy Tutel version. Can you provide the checkpoint link you use, and SWIN command to load it if possible?

Hi, I have loaded the checkpoint from (https://github.com/SwinTransformer/storage/releases/download/v2.0.2/swin_moe_small_patch4_window12_192_32expert_32gpu_22k.zip).

I used the command: torchrun --nproc_per_node=1 --nnode=1 --master_port 12347 main_moe.py --cfg configs/swinmoe/swin_moe_small_patch4_window12_192_32expert_32gpu_1k 128.yaml --data-path imagenet --batch-size 128 --pretrained swin_moe_small_patch4_window12_192_32expert_32gpu_22k/swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth
(For "swin_moe_small_patch4_window12_192_32expert_32gpu_1k", I used imagenet 1k to fintuning, and I only modify the dataset)
If you have any suggestions or improvements, please let me know. Thank you!

Image

@ghostplant
Copy link
Contributor

I just merged a PR (#249) to ensure checkpoint compatible with legacy format.

Can you upgrade tutel installation, and follow this New Steps to convert the SWIN checkpoints and see if it works?

@ywxsuperstar
Copy link
Author

I just merged a PR (#249) to ensure checkpoint compatible with legacy format.

Can you upgrade tutel installation, and follow this New Steps to convert the SWIN checkpoints and see if it works?

Thank you for your modifications! I upgraded the Tutel installation and followed the steps to merge the checkpoint. However, the issue still persists. Upon printing the merged checkpoint, I noticed a mismatch with the model.

erorr:
[rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]).
[rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]).
[rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]).
[rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]).
[rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]).
[rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]).
[rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]).
[rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]).
[rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]).
[rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]).
[rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 3072, 768]) from checkpoint, the shape in current model is torch.Size([32, 3072, 768]).
[rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 3072, 768]) from checkpoint, the shape in current model is torch.Size([32, 3072, 768]).
[rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 3072]) from checkpoint, the shape in current model is torch.Size([32, 3072]).

Therefore, it seems that the dimensions being merged may be incorrect, and not all the parameters of the experts have been combined properly?

@ghostplant
Copy link
Contributor

ghostplant commented Oct 28, 2024

I think you may not follow the instructions correctly. The examples in tutorial should exactly work for your zip file. You should get a new checkpoint folder new_swin_moe_small_for_2_gpus/ containing 3 files in total. Please check if your new checkpoint folder corresponds with the file list below:

$ ls -ls new_swin_moe_small_for_2_gpus/
total 2121952
202872 -rw-r--r-- 1 root root 207739264 Oct 28 04:20 swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth.master
959540 -rw-r--r-- 1 root root 982561440 Oct 28 04:19 swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth.rank0
959540 -rw-r--r-- 1 root root 982561440 Oct 28 04:19 swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth.rank1

@ghostplant
Copy link
Contributor

Feel free to let us know if the issue is still unsolved.

@ywxsuperstar
Copy link
Author

Thank you very much. The issue has been solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants