Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence Parallel Forward Transformer #3338

Merged
merged 6 commits into from
Nov 19, 2024
Merged

Conversation

cowanmeg
Copy link
Collaborator

@cowanmeg cowanmeg commented Nov 4, 2024

Sequence parallel forward transformer layer and multi-headed attention tests.

  1. Cleans up sharding annotations in Forward fusion definitions. Only sharding changes and inputs are explicitly sharded.
  2. Updates output of mha and mlp to be a struct with named TVs to make code more readable.
  3. Dropout probability is temporarily set to 0. This will be fixed in a later PR to use philox seed and offset with validation.

@cowanmeg
Copy link
Collaborator Author

!build

Copy link
Collaborator

@wujingyue wujingyue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'm still reviewing the MHA part...

tests/cpp/test_multidevice_transformer.cpp Outdated Show resolved Hide resolved
tests/cpp/test_multidevice_transformer.cpp Show resolved Hide resolved
tests/cpp/test_multidevice_transformer.cpp Show resolved Hide resolved
tests/cpp/test_multidevice_transformer.cpp Show resolved Hide resolved
@@ -1074,11 +1305,11 @@ TEST_P(DistributedTransformerTest, Forward) {
auto ln_input = castOp(DataType::Float, x);
auto ln0 = layer_norm(ln_input, norm_shape, ln0_w, ln0_b, eps);
auto mha_in = castOp(dtype, ln0.output);
auto mha_out = mha(mha_in, mha_w0, mha_b0, mha_w1, mha_b1, mesh)[3];
auto mha_out = mha(mha_in, mha_w0, mha_b0, mha_w1, mha_b1, mesh).output;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@cowanmeg
Copy link
Collaborator Author

!build

@cowanmeg cowanmeg merged commit 6f0909e into NVIDIA:main Nov 19, 2024
16 checks passed
@liqiangxl
Copy link
Collaborator

check DistributedTransformerTest.MultiheadAttention_SP/__half
!test

@liqiangxl
Copy link
Collaborator

!test

Priya2698 pushed a commit that referenced this pull request Nov 20, 2024
Sequence parallel forward transformer layer and multi-headed attention
tests.

1. Cleans up sharding annotations in Forward fusion definitions. Only
sharding changes and inputs are explicitly sharded.
2. Updates output of mha and mlp to be a struct with named TVs to make
code more readable.
3. Dropout probability is temporarily set to 0. This will be fixed in a
later PR to use philox seed and offset with validation.
jacobhinkle pushed a commit that referenced this pull request Dec 3, 2024
Sequence parallel forward transformer layer and multi-headed attention
tests.

1. Cleans up sharding annotations in Forward fusion definitions. Only
sharding changes and inputs are explicitly sharded.
2. Updates output of mha and mlp to be a struct with named TVs to make
code more readable.
3. Dropout probability is temporarily set to 0. This will be fixed in a
later PR to use philox seed and offset with validation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants