Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suboptimal segmentation for RoPE #2599

Closed
wujingyue opened this issue Jul 15, 2024 · 2 comments
Closed

Suboptimal segmentation for RoPE #2599

wujingyue opened this issue Jul 15, 2024 · 2 comments
Assignees
Labels
Indexing Ops rope Segmentation Issues related to nvFuser Segmentation Triage

Comments

@wujingyue
Copy link
Collaborator

wujingyue commented Jul 15, 2024

TL;DR

the RoPE module in Llama-2 (and possibly other configs as well) is over-segmented into six pointwise kernels. Two should suffice. This issue blocks Lightning-AI/lightning-thunder#731, although we don't necessarily have to reach two to unblock.

To reproduce

$ pwd
/opt/pytorch/lightning-thunder
$ git checkout wjy/bookend
$ NVFUSER_DUMP=segmented_fusion pytest thunder/benchmarks/targets.py -k test_litgpt_qkv_split_rope[Llama-2-13b-hf-forward-bs2-thunder] -s | grep 'g{(pointwise' | wc -l
6

Problem

The following figure shows the current suboptimal segmentation

RoPE

The highlighted ops form the three pointwise kernels for Q. There's a similar pattern for K that's omitted for simplicity, so there are in total six pointwise kernels. The remaining, meta ops are expectedly segmented out as "no-op" regions.

Note: For now, I omitted the Pads before a Cat for simplicity. Soonish, #2373, pending, may move Pads upstream, affecting segmentation.

There are two main reasons we reached this state.

  1. The segmenter enforces a segment boundary before Slice and Pad (and thus before Cat).
  2. The Squeeze and the red ToFloat were merged too early. This made it impossible to keep merging the red and the green segment, because Squeeze is a transitive input to the Cat and the merge would lead to a cycle. Several sub-reasons contributed to that order:
    • The merging runs in topological order.
    • The segmenter merges one edge at a time. When attempting to merge into the green + its operands, the segmenter only merged the right operand instead of both.
    • A newly merged segment is pushed to the end of the worklist.
@wujingyue
Copy link
Collaborator Author

The attempted fix had a mixed effect on performance: #2815 (comment). So this is still open. I'm reassigning this to @jjsjann123, the POC for RoPE.

@wujingyue wujingyue assigned jjsjann123 and unassigned wujingyue Nov 1, 2024
@jjsjann123
Copy link
Collaborator

We can close this one for now. Since with the preseg passes we are no longer segment across cat any more.

Though suboptimal segmentation still remains an issue for rope, where @naoyam 's scheduler change should be able to resolve that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Ops rope Segmentation Issues related to nvFuser Segmentation Triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants