-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge up-cast, ops, down-cast sequences as minimal units of segments #3699
Conversation
!test |
!test --diff |
!test --pybench-full |
!test |
PR Reviewer Guide 🔍(Review updated until commit 3b96de7)Here are some key observations to aid the review process:
|
@jjsjann123 confirmed there's no performance regression with the existing benchmarks. There's some small number of cases that are indeed improved by this PR. @jjsjann123 Please review the PR. I just updated with some minor cleanups. No change with the logic. I thought what kinds of tests we could have, but I don't have a good idea. As long as all the existing tests are working functionally with no perf regression, it seems it's good enough to me. |
@xwang233 regarding the review thing. The implementation change in SegmentCandidateFinder::findSegments is mistakenly identified as I'm not sure if this is just a category given by the llm, so don't know if there's any actionable item on this one. |
It's not a preset category. Perhaps you can help rewrite and find the best prompt given to llm here Fuser/.github/workflows/pull.yml Lines 68 to 70 in 0d0402f
|
csrc/fusion_segmenter.cpp
Outdated
} | ||
|
||
return std::make_pair( | ||
primDataTypeSize(*inp_prim_type), primDataTypeSize(*out_prim_type)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you needed a bail out here when either inp_prim_type
or out_prim_type
are PrimDataType::Index
, which primDataTypeSize
would throw an exception. I think welford would have that.
I only know that when I hit it in the presegmentation passes. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we could use a test for that. We don't have to do it in this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test
That looks like a pretty reasonable prompt. So it is just the model not properly identify what's a signature change in the diff. 😠 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
!test |
This PR attempts to improve fusion segmentation by using a common pattern we see in fusions given by Thunder. Specifically, it is quite common to see a pattern like below:
Background
Here, the half-precision inputs are upcast to float for the arithmetic op and then the result is cast back to half. This pattern of upcasting to float, doing some arithmetic ops, and then casting back to half is quite common. The above example is just a trivial fusion, but larger more complex fusions can have many sequences of ops like this.
Problem
In some of the RoPE cases, I have observed inefficient segmentation results where a large fusion is segmented after an upcast op like the one shown above. Suppose the above pattern appears in a large fusion and there's no scheduler that can accept the whole fusion without segmentation. The fusion segmenter may segment the fusion like blow:
Ideally, since this is just a trivial arithmetic op, it should be considered an atomic op and should never be segmented out, i.e., it should be either:
or
This is not as good as the first option as there are two intermediate tensors between the segments, but generally, it is likely we should be able to get better segmentations by considering these patterns as non-decomposable atomic operations.
Proposal
This PR tries to exploit the above observation by adding another pre-merge step in the fusion segmenter. Specifically,
MergeUpAndDownCast::run
analyzes a given fusion and tries to detect several simple patterns like the above and see if they can be merged to a segment. The result of this initial merging is then used by the existing merge process.The pattern that the current analysis consider is basically an up-cast op followed by some other ops until a down-cast op. When merging these ops, we need to be careful so that the DAG property should not be violated. In order to guarantee the property, the current analysis stops when an expr has a consumer tensor that is used by multiple ops. So, for example, given a fusion like:
For this fusion, we are not trying to do anything as
T1_fp32
is used by both of the sin and cos ops. This is a limitation but not urgent at this moment for RoPE, so I decided to make it as simple as possible.When an expr has multiple producers, we traverse all of them and as long as those producers are only used by the expr, we keep looking for an upcast operation. If a producer has multiple uses, guaranteeing the DAG property becomes non-trivial, which I'm not addressing in this PR.
Results
Here's per-segment times of the Mistral forward RoPE measured using ncu running on an H100:
Here's with this PR:
The total times are roughly 130 us vs 80 us.