PyTorch native 2D LLaMA inference #922

kwen2501 · 2023-12-21T23:48:08Z

Current status

Working

# PP = 2, TP = 4
$ torchrun --nproc-per-node 8 pippy_llama.py
['make', 'think', 'you', 'be', 'getting', 'great', 'favorite', 'right']
['make', 'think', 'you', 'be', 'getting', 'great', 'favorite', 'right']
['make', 'think', 'you', 'be', 'getting', 'great', 'favorite', 'right']
['make', 'think', 'you', 'be', 'getting', 'great', 'favorite', 'right']

Previous issues:

TP self attention hitting the following issue:

view = l__self___model_layers_0_self_attn_q_proj.view(4, 4, 32, 128);  l__self___model_layers_0_self_attn_q_proj = None
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape '[4, 4, 32, 128]' is invalid for input of size 16384

4 * 4 * 32 * 128 = 65536
65536 / 4 = 16384 (4 is my TP size)
so that explains it.
User code:

xq = xq.view(bsz, seqlen, self.n_local_heads, self.head_dim)

Cc: @fduwjj @wanchaol @HamidShojanazeri @wconstab

Can you shed light here?
@fduwjj mentioned that we would need to modify self.n_local_heads to be 4 times smaller -- whether in eager case or traced case.
In the traced case, I can modify the view node to change its arg, for example, 32 -> 8. That's slightly better than asking user to modify model code. But, is there a better way?

fduwjj · 2023-12-22T05:45:47Z

examples/llama/2d_llama.py

+        f"{layer_name}_mlp_down_proj": RowwiseParallel(),
+    })
+tp_mesh = mesh_2d["tp"]
+parallelize_module(stage.submod, tp_mesh, plan)


What you need to do for a submod is that:

self.n_local_heads = self.n_local_heads // tp_degree

Tried, but doesn't seem to work in this case.

Trial 1: modify num_heads before tracing
Tracer will error out:

shape '[4, 4, 8, 128]' is invalid for input of size 65536

i.e. tracer seems to be very smart to do some size check. In this case, the input size is still the old size, so they mismatch.

Trial 2: modify num_heads after tracing
This won't work because the old value of num_heads has been burned into the generated program during tracing.

What works:
Modify the view and reshape ops' arguments after tracing
e.g. Modified "view" node's arg from:
(l__self___model_layers_0_self_attn_q_proj, 4, 4, 32, 128) to
(l__self___model_layers_0_self_attn_q_proj, 4, 4, 8, 128)
See the added util modify_view

wanchaol

Please see comments inlined. I don't think we should have this modify_view operation to compose TP and PP, it's very dangerous to users. I think we should probably not trace into the TranformerBlock and leave it for TP, or make TP happen first.

wanchaol · 2024-01-02T06:49:33Z

examples/llama/2d_llama.py

+
+
+# Utility
+def modify_view(


hmmmm this is a very dangerous operation, we shouldn't compose PP + TP like this IMO. In particular this would modify any view operations in the traced graph and thereby making the graph be super unsafe to the user, also this is not scalable to another models especially non-llama model might have view operations in non-attention layers, this would either trigger the assertion failure or wrongly modify the view ops.

Thanks much @wanchaol for the comments. I agree with your thoughts. modify_view is more of a workaround than a general solution. For a better one, I wonder if the following direction is worth consideration.
Today, for TP to work well with attention, view ops of user code needs to be modified. This is true whether for eager mode or graphed mode. I wonder if it would be possible for the output of ColwiseParalell to play well with unchanged view ops. This may require: the output of ColwiseParalell to stay in some kind of DTensor form (rather than a regular tensor), and for this DTensor form to have special rule with the view ops (i.e. recognizing that some dimension of the view ops needs to be adjusted based on how the source operation is distributed).
Any thoughts here?
Cc @fduwjj

Removed modify_view() now

HamidShojanazeri · 2024-01-02T13:12:44Z

examples/llama/2d_llama.py

+mesh_2d = init_device_mesh("cuda", (pp_group_size, tp_group_size), mesh_dim_names=("pp", "tp"))
+pp_group = mesh_2d["pp"].get_group()
+
+llama.to(device).eval()


it would be great to start envisioning the deferred init for the next step.

Agree. We are working towards that direction. See for example this PR and the BERT example in it.
#923

HamidShojanazeri

Functionality wise is working for me and LGTM.

wanchaol · 2024-01-02T23:43:10Z

examples/llama/2d_llama.py

+
+# We set this flag to true to allow operations on a mix of tensor and dtensor
+# arguments. The mix is a result of `use_local_output=False`
+DTensor._op_dispatcher._allow_implicit_replication = True


I don't think we should set this flag, it's a hack and only suppose to be used by FSDP...

what does it do?

wanchaol · 2024-01-02T23:44:11Z

pippy/PipelineStage.py

+                    # HACK: we convert DTensor to regular tensor here for it to
+                    # work with send ops. DTensor may show up in PP + TP cases.
+                    out.to_local()
+                    if isinstance(out, torch.distributed._tensor.DTensor)


I think we should understand why isend would get a DTensor, if Pipeline split each transformerblock it should not get DTensor as inputs

wanchaol · 2024-01-03T00:03:43Z

I still think the cleanest fix here is to make PP tracing + unflattener work, otherwise we should probably wait for DTensor supports scaled dot product attention op instead, the current thing that use_local_outputs works surprised me, I think the only reason is that the llama 7B does not use scaled_dot_product_attention

fduwjj · 2024-01-08T22:25:52Z

If they are all tensors, scaled_dot_product_attention should work as long as we pass in correct sizes?

kwen2501 · 2024-01-11T21:03:59Z

Documenting my discussion with @wanchaol wrt to DTensor and scaled_dot_product_attention:

@kwen2501 :
Should we do to_local as soon as we did colwise, or should we do to_local when we hit some op like scaled dot product, or should we have scaled dot product support a local form of DTensor. Maybe 2 and 3 are the same thing, meaning, the dispatcher of DTensor performs a to_local before calling the actual scaled dot product.

@wanchaol :
The current way is that we do to_local as soon as we leave the linear layer computation, this is the easiest thing to do with module forward hooks, if we do to_local as soon as we hit op like scaled dot product attention, I feel this is technically like implementing the scaled dot product attention op already. i.e. when implementing a DTensor op, we just figure out the sharding and then call local tensor with the op

@kwen2501 :
Now, in this case, the view ops are between colwise and scaled dot product. So it seems that the delayed route would work better. But i do agree that, if without the view ops, the early route would be easier. This means, the delayed route is a user choice (likely non-default), and we patch that route with DTensor support of scale dot product.

@wanchaol :
Yeah I think we should support both routes via use_local_output=False/True.
the delayed route require us to implement scaled dot product attention I think but it shouldn’t be too hard to enable it.

facebook-github-bot added the cla signed label Dec 21, 2023

fduwjj reviewed Dec 22, 2023

View reviewed changes

kwen2501 force-pushed the 2d_llama branch from 654028f to e55eef4 Compare December 28, 2023 17:24

kwen2501 changed the title ~~[WIP] PyTorch native 2D LLaMA~~ PyTorch native 2D LLaMA Dec 28, 2023

kwen2501 changed the title ~~PyTorch native 2D LLaMA~~ PyTorch native 2D LLaMA inference Dec 28, 2023

wanchaol requested changes Jan 2, 2024

View reviewed changes

HamidShojanazeri reviewed Jan 2, 2024

View reviewed changes

HamidShojanazeri approved these changes Jan 2, 2024

View reviewed changes

kwen2501 and others added 6 commits January 2, 2024 13:21

2D working without TP self attention

e90d41c

Modify view ops to make them compatible with TP

78ccf59

Comments

ab40031

adding deferred init as Ke advised

0cc816d

Remove modify_view()

37e110c

Rearrange code

c0f6152

kwen2501 force-pushed the 2d_llama branch from cf74348 to c0f6152 Compare January 2, 2024 23:23

wanchaol reviewed Jan 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch native 2D LLaMA inference #922

PyTorch native 2D LLaMA inference #922

kwen2501 commented Dec 21, 2023 •

edited

Loading

fduwjj Dec 22, 2023

kwen2501 Dec 28, 2023 •

edited

Loading

kwen2501 Dec 28, 2023

wanchaol left a comment •

edited

Loading

wanchaol Jan 2, 2024

kwen2501 Jan 2, 2024 •

edited

Loading

kwen2501 Jan 2, 2024

HamidShojanazeri Jan 2, 2024

kwen2501 Jan 2, 2024

HamidShojanazeri left a comment

wanchaol Jan 2, 2024

wconstab Jan 12, 2024

wanchaol Jan 2, 2024

wanchaol commented Jan 3, 2024

fduwjj commented Jan 8, 2024

kwen2501 commented Jan 11, 2024

PyTorch native 2D LLaMA inference #922

Are you sure you want to change the base?

PyTorch native 2D LLaMA inference #922

Conversation

kwen2501 commented Dec 21, 2023 • edited Loading

Current status

Previous issues:

Choose a reason for hiding this comment

kwen2501 Dec 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanchaol left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwen2501 Jan 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HamidShojanazeri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanchaol commented Jan 3, 2024

fduwjj commented Jan 8, 2024

kwen2501 commented Jan 11, 2024

kwen2501 commented Dec 21, 2023 •

edited

Loading

kwen2501 Dec 28, 2023 •

edited

Loading

wanchaol left a comment •

edited

Loading

kwen2501 Jan 2, 2024 •

edited

Loading