-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding the new feature of FPDT #441
base: main
Are you sure you want to change the base?
Conversation
…on for supporting batch size larger than 1
Hi @YJHMITWEB , is FPDT referring to this paper? https://ui.adsabs.harvard.edu/abs/2023JARS...17b6510H/abstract |
@YJHMITWEB Do we need changes in |
megatron/initialize.py
Outdated
@@ -349,9 +349,12 @@ def _warmup_jit_function(): | |||
dtype = torch.float32 | |||
|
|||
# Warmup fused bias+gelu | |||
seq_length = args.seq_length | |||
if args.ds_sequence_parallel_fpdt: | |||
seq_length = 8192 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you define this as another variable like "FPDT_SEQ_LEN" and give a description in a comment why we have this setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fixed by setting it to be ds_sequence_parallel_fpdt_chunk_size
if FPDT is enabled.
@@ -32,7 +35,9 @@ def forward(self, max_seq_len, offset=0): | |||
emb = torch.cat((freqs, freqs), dim=-1) | |||
# emb [seq_length, .., dim] | |||
from einops import rearrange | |||
return rearrange(emb, 'n d -> n 1 1 d') | |||
base = rearrange(emb, 'n d -> n 1 1 d') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this change the output when use --use-rotary-position-embeddings
, in llama style model?
FYI https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples_deepspeed/pretrain_llama2_distributed.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have tested both GPT and Llama models, this works well with both.
@delock, no, FPDT refers to this paper, aka Ulysses-Offload |
Thanks @samadejacobs for pointing. |
|
@microsoft-github-policy-service agree |
[FPDT](https://arxiv.org/abs/2408.16978) can only be used with [this version](microsoft/Megatron-DeepSpeed#441) of Megatron-DeepSpeed. --------- Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Sam Ade Jacobs <[email protected]> Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]>
[FPDT](https://arxiv.org/abs/2408.16978) can only be used with [this version](microsoft/Megatron-DeepSpeed#441) of Megatron-DeepSpeed. --------- Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Sam Ade Jacobs <[email protected]> Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]>
[FPDT](https://arxiv.org/abs/2408.16978) can only be used with [this version](microsoft/Megatron-DeepSpeed#441) of Megatron-DeepSpeed. --------- Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Sam Ade Jacobs <[email protected]> Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]>
[FPDT](https://arxiv.org/abs/2408.16978) can only be used with [this version](microsoft/Megatron-DeepSpeed#441) of Megatron-DeepSpeed. --------- Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Sam Ade Jacobs <[email protected]> Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]>
[FPDT](https://arxiv.org/abs/2408.16978) can only be used with [this version](microsoft/Megatron-DeepSpeed#441) of Megatron-DeepSpeed. --------- Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Sam Ade Jacobs <[email protected]> Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]>
key_layer = mixed_x_layer[:, :, self.projection_size:self.projection_size+self.kv_projection_size].reshape(seq_len, bs, -1, self.head_dim) | ||
value_layer = mixed_x_layer[:, :, self.projection_size+self.kv_projection_size:].reshape(seq_len, bs, -1, self.head_dim) | ||
if self.sequence_parallel or not self.enable_ds_sequence_parallel: | ||
seq_len, bs = mixed_x_layer.shape[0], mixed_x_layer.shape[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worthwhile to keep "split_tensor" implementation as default, that is not fpdt scenario.
FPDT can only be work with this version of DeepSpeed.