inner cache implemented #858

moonbucks · 2023-08-05T16:50:18Z

Description

instead of direct passing, use internal cache

Type of change

[v] New feature (non-breaking change which adds functionality)

Checklist:

[v] Have you added tests that prove your fix is effective or that this feature works?
[v] Has code been commented, particularly in hard-to-understand areas?
[v] Have you made corresponding changes to the documentation?

fduwjj · 2023-08-08T21:27:58Z

pippy/PipelineStage.py

        # Find my submodule
        self.split_gm = self.pipe.split_gm
        named_children = list(self.split_gm.named_children())
+


is this still needed?

@fduwjj comment is left on blank line. if it indicate L91-L92 or L95, we will leave it for safety reason until we modify all functions to use 'list of submods' instead of one submod.

I am indeed asking about L95.

…last inner node (self.nodes[-1]) instead of self.node

moonbucks · 2023-08-09T00:02:31Z

code rebased

fduwjj · 2023-08-09T03:03:07Z

pippy/PipelineStage.py

+        if self.inner_depth > 1:
+            # Forward pass of all chunks
+            for chunk in range(self.chunks):
+                s = self.streams[chunk % self.nstreams]
+                with torch.cuda.stream(s):
+                    output, send_reqs = self.forward_one_chunk_ipipe(
+                        chunk, args_split, kwargs_split, fwd_cache
+                    )
+                    all_send_reqs += send_reqs
+                    # Prepare for final output merge or reduction
+                    output_chunks[chunk] = output
+        else:
+            # Forward pass of all chunks
+            for chunk in range(self.chunks):
+                s = self.streams[chunk % self.nstreams]
+                with torch.cuda.stream(s):
+                    output, send_reqs = self.forward_one_chunk(
+                        chunk, args_split, kwargs_split, fwd_cache
+                    )
+                    all_send_reqs += send_reqs
+                    # Prepare for final output merge or reduction
+                    output_chunks[chunk] = output


The code has lots of duplication here, can you kindly consolidate a little bit?

cleaned in the subsequent commit

fduwjj · 2023-08-09T03:04:28Z

pippy/PipelineStage.py

+        kwargs_split,
+        fwd_cache: Dict[int, Any],
+    ):
+        if self.rank == self.nstages - 1:


If I understand correctly, self.nstages is the global stages? If so does this line of code ever gets hit?

self.nstages has been used in original codebase so i added a new variable self.global_depth to store global stages. so self.nstages is same as pp_group_size. (# of ranks)

fduwjj · 2023-08-09T04:38:15Z

pippy/PipelineStage.py

+
+        try:
+            if self.rank == self.nstages - 1:  # last stage
+                output = self.forward_maybe_with_nosync(


Do you need a inner_depth == 0 check here?

no we don't. inner_depth == 1 is default (no inner pipelining) and when inner_depth == 1, this function is not called. Instead, it calls forward_one_chunk function.

fduwjj

LGTM, maybe we can consolidate the logic and extrac common logic to make the code look cleaner?

facebook-github-bot added the cla signed label Aug 5, 2023

moonbucks force-pushed the ip3 branch from 4e9b685 to 6fdbc19 Compare August 8, 2023 19:05

fduwjj reviewed Aug 8, 2023

View reviewed changes

moonbucks added 13 commits August 8, 2023 23:11

if inner_depth > 1, each PipelineStage holds multiple submods

bb27f67

forward internal pipes implemented

88b31f2

send fixed: _create_act_send_info, _send_activations modified to use …

956851c

…last inner node (self.nodes[-1]) instead of self.node

targets (labels) passing to last submod

bc43693

lint

6fc89a9

removed debug prints

feb0aa3

added a demo training/inference code

ee71ca0

add dist.barrier after each iteration in inference

84e375a

backward check

88b774b

demo works

40e5fb0

move conditional executions in forward-with-nosync out of it

e0e2b1b

inner pipe cache is added and used to store intermediate value

b999fea

load value from pipe_cache

a8f2014

moonbucks force-pushed the ip3 branch from 6fdbc19 to a8f2014 Compare August 8, 2023 23:11

moonbucks added 2 commits August 8, 2023 23:15

removed duplicate demo code

a090a83

lint

e0b1357

fduwjj reviewed Aug 9, 2023

View reviewed changes

code cleaned

139442e

moonbucks force-pushed the ip3 branch from 3942e82 to 139442e Compare August 9, 2023 17:42

fduwjj approved these changes Aug 10, 2023

View reviewed changes

moonbucks merged commit b740566 into pytorch:pp_tp_optimization Aug 11, 2023
21 of 25 checks passed

moonbucks deleted the ip3 branch August 11, 2023 01:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inner cache implemented #858

inner cache implemented #858

moonbucks commented Aug 5, 2023

fduwjj Aug 8, 2023

moonbucks Aug 8, 2023

fduwjj Aug 9, 2023

moonbucks commented Aug 9, 2023

fduwjj Aug 9, 2023

moonbucks Aug 9, 2023

fduwjj Aug 9, 2023 •

edited

Loading

moonbucks Aug 9, 2023

fduwjj Aug 9, 2023

moonbucks Aug 9, 2023

fduwjj left a comment

inner cache implemented #858

inner cache implemented #858

Conversation

moonbucks commented Aug 5, 2023

Description

Type of change

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

moonbucks commented Aug 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fduwjj Aug 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fduwjj left a comment

Choose a reason for hiding this comment

fduwjj Aug 9, 2023 •

edited

Loading