Adding the new feature of FPDT #441

YJHMITWEB · 2024-08-29T23:47:32Z

FPDT can only be work with this version of DeepSpeed.

…on for supporting batch size larger than 1

delock · 2024-08-30T01:24:16Z

Hi @YJHMITWEB , is FPDT referring to this paper? https://ui.adsabs.harvard.edu/abs/2023JARS...17b6510H/abstract

tohtana · 2024-08-30T05:17:41Z

@YJHMITWEB Do we need changes in gpt2-merge.txt / gpt2-vocab.json? I'm not sure if we should check them in.

tohtana · 2024-08-30T05:21:11Z

megatron/initialize.py

@@ -349,9 +349,12 @@ def _warmup_jit_function():
        dtype = torch.float32

    # Warmup fused bias+gelu
+    seq_length = args.seq_length
+    if args.ds_sequence_parallel_fpdt:
+        seq_length = 8192


Can you define this as another variable like "FPDT_SEQ_LEN" and give a description in a comment why we have this setting?

This is fixed by setting it to be ds_sequence_parallel_fpdt_chunk_size if FPDT is enabled.

inkcherry · 2024-08-30T05:46:01Z

megatron/model/rotary_pos_embedding.py

@@ -32,7 +35,9 @@ def forward(self, max_seq_len, offset=0):
        emb = torch.cat((freqs, freqs), dim=-1)
        # emb [seq_length, .., dim]
        from einops import rearrange
-        return rearrange(emb, 'n d -> n 1 1 d')
+        base = rearrange(emb, 'n d -> n 1 1 d')


will this change the output when use --use-rotary-position-embeddings, in llama style model?
FYI https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples_deepspeed/pretrain_llama2_distributed.sh

We have tested both GPT and Llama models, this works well with both.

samadejacobs · 2024-09-19T17:57:06Z

Hi @YJHMITWEB , is FPDT referring to this paper? https://ui.adsabs.harvard.edu/abs/2023JARS...17b6510H/abstract

@delock, no, FPDT refers to this paper, aka Ulysses-Offload

YJHMITWEB · 2024-10-07T01:25:39Z

Hi @YJHMITWEB , is FPDT referring to this paper? https://ui.adsabs.harvard.edu/abs/2023JARS...17b6510H/abstract

@delock, no, FPDT refers to this paper, aka Ulysses-Offload

Thanks @samadejacobs for pointing.

YJHMITWEB · 2024-10-10T18:39:51Z

@microsoft-github-policy-service agree

YJHMITWEB · 2024-10-10T18:40:23Z

@microsoft-github-policy-service agree

[FPDT](https://arxiv.org/abs/2408.16978) can only be used with [this version](microsoft/Megatron-DeepSpeed#441) of Megatron-DeepSpeed. --------- Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Sam Ade Jacobs <[email protected]> Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]>

samadejacobs · 2024-12-02T23:48:47Z

megatron/model/transformer.py

+                key_layer = mixed_x_layer[:, :, self.projection_size:self.projection_size+self.kv_projection_size].reshape(seq_len, bs, -1, self.head_dim)
+                value_layer = mixed_x_layer[:, :, self.projection_size+self.kv_projection_size:].reshape(seq_len, bs, -1, self.head_dim)
+            if self.sequence_parallel or not self.enable_ds_sequence_parallel:
+                seq_len, bs = mixed_x_layer.shape[0], mixed_x_layer.shape[1]


Might be worthwhile to keep "split_tensor" implementation as default, that is not fpdt scenario.

Jinghan Yao and others added 5 commits August 3, 2024 00:09

pass batch_dim_idx to deepspeed sequence parallel distributed attenti…

309d3f0

…on for supporting batch size larger than 1

Merge branch 'microsoft:main' into main

75a7fb6

add FPDT support; add Ulysses rotary position embedding support

79b8e22

add FPDT support; add Ulysses rotary position embedding support

9500979

add FPDT support; add Ulysses rotary position embedding support

daa8528

YJHMITWEB requested review from tjruwase, awan-10, eltonzheng, duli2012, arashb and GuanhuaWang as code owners August 29, 2024 23:47

YJHMITWEB mentioned this pull request Aug 29, 2024

Adding the new feature of FPDT microsoft/DeepSpeed#6462

Open

add FPDT support; add Ulysses rotary position embedding support

f47b094

tohtana reviewed Aug 30, 2024

View reviewed changes

inkcherry reviewed Aug 30, 2024

View reviewed changes

Jinghan Yao added 2 commits October 6, 2024 21:16

remove unnecessary files

99950c0

set the warmup length to be FPDT chunk size if enabled

ec78191

samadejacobs reviewed Dec 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding the new feature of FPDT #441

Adding the new feature of FPDT #441

YJHMITWEB commented Aug 29, 2024 •

edited by samadejacobs

Loading

delock commented Aug 30, 2024

tohtana commented Aug 30, 2024 •

edited

Loading

tohtana Aug 30, 2024 •

edited

Loading

YJHMITWEB Oct 7, 2024

inkcherry Aug 30, 2024 •

edited

Loading

YJHMITWEB Oct 7, 2024

samadejacobs commented Sep 19, 2024

YJHMITWEB commented Oct 7, 2024

YJHMITWEB commented Oct 10, 2024

YJHMITWEB commented Oct 10, 2024

samadejacobs Dec 2, 2024

Adding the new feature of FPDT #441

Are you sure you want to change the base?

Adding the new feature of FPDT #441

Conversation

YJHMITWEB commented Aug 29, 2024 • edited by samadejacobs Loading

delock commented Aug 30, 2024

tohtana commented Aug 30, 2024 • edited Loading

tohtana Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

YJHMITWEB Oct 7, 2024

Choose a reason for hiding this comment

inkcherry Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

YJHMITWEB Oct 7, 2024

Choose a reason for hiding this comment

samadejacobs commented Sep 19, 2024

YJHMITWEB commented Oct 7, 2024

YJHMITWEB commented Oct 10, 2024

YJHMITWEB commented Oct 10, 2024

samadejacobs Dec 2, 2024

Choose a reason for hiding this comment

YJHMITWEB commented Aug 29, 2024 •

edited by samadejacobs

Loading

tohtana commented Aug 30, 2024 •

edited

Loading

tohtana Aug 30, 2024 •

edited

Loading

inkcherry Aug 30, 2024 •

edited

Loading