Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using batch_first=True for torch native transformer blocks seem to result in non-complete summaries #300

Open
cwestergren opened this issue Feb 4, 2024 · 0 comments

Comments

@cwestergren
Copy link

cwestergren commented Feb 4, 2024

Describe the bug
Using batch_first=True for torch native transformer blocks seem to result in non-complete summaries. When setting batch_ffirst=False, it looks good.

To Reproduce
Steps to reproduce the behavior:

encoder_layer = nn.TransformerEncoderLayer(d_model=768, nhead=12, dim_feedforward=3072, activation='gelu', batch_first=True, norm_first=True) # For some weird reason, the transformer block with batch first fails.
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=1)
summary(transformer_encoder, input_size=(1,197,768), col_names=['input_size', 'output_size', 'num_params', 'trainable'])

will correctly print out the below summary.

============================================================================================================================================
Layer (type:depth-idx) Input Shape Output Shape Param # Trainable

TransformerEncoder [1, 197, 768] [1, 197, 768] -- True
├─ModuleList: 1-1 -- -- -- True
│ └─TransformerEncoderLayer: 2-1 [1, 197, 768] [1, 197, 768] -- True
│ │ └─LayerNorm: 3-1 [1, 197, 768] [1, 197, 768] 1,536 True
│ │ └─MultiheadAttention: 3-2 [1, 197, 768] [1, 197, 768] 2,362,368 True
│ │ └─Dropout: 3-3 [1, 197, 768] [1, 197, 768] -- --
│ │ └─LayerNorm: 3-4 [1, 197, 768] [1, 197, 768] 1,536 True
│ │ └─Linear: 3-5 [1, 197, 768] [1, 197, 3072] 2,362,368 True
│ │ └─Dropout: 3-6 [1, 197, 3072] [1, 197, 3072] -- --
│ │ └─Linear: 3-7 [1, 197, 3072] [1, 197, 768] 2,360,064 True
│ │ └─Dropout: 3-8 [1, 197, 768] [1, 197, 768] -- --

Using batch_first=True i get another behaviour,

============================================================================================================================================
Layer (type:depth-idx) Input Shape Output Shape Param # Trainable

TransformerEncoder [1, 197, 768] [1, 197, 768] -- True
├─ModuleList: 1-1 -- -- -- True
│ └─TransformerEncoderLayer: 2-1 [1, 197, 768] [1, 197, 768] 7,087,872 True

Expected behavior
I would expect that the printout would be the "the same".

Desktop (please complete the following information):

  • Google Colab on Edge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant