-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
group norm segmented into pointwise + persistent + pointwise #2375
Comments
I wonder why |
Due to |
There's only one normalization in |
Looks like thunder did some optimizations of the captured graph. Specifically, it does
|
Thanks for the suggestion @jjsjann123 and @kevinstephano, nvFuser gets two reshapes if
If we change these two reshapes to
So the proposed plan is a 2-steps approach:
|
Add original related issue. Lightning-AI/lightning-thunder#468 |
Unfortunately, the current alias analysis wasn't built for tracking aliases involving intermediates. When I wrote it, I didn't see a strong case for dealing with intermediates and expected intermediates to be fused into a kernel and become cheap index calculation. However, in this case, it appears that the two reshapes have caused the normalization scheduler to bail out... #2405 sounds like a reasonable extension -- add segment_set after input reshapes and before output reshapes. Caveat: there are some tricky patterns that we'll need to consider, e.g., if the original input is used by an operation other than the reshape, we probably shouldn't segment out the reshape. But we can leave these important implementation details to PR discussion. That being said, is the normalization scheduler supposed to handle the two reshapes? That feels the right solution to me. However, if it would take a long time, I would totally understand the need for a quicker workaround. |
Thanks for the write-up, @liqiangxl! The problem looks quite clear to me even though I was unsure about the solution. |
It should be able to handle the first reshape (which only includes split of ID), but for the second reshape (which is merge of an iter ID and a reduction ID) needs a lot of work. So another option is we can extend current reduction/normalization scheduler to handle some kinds of reshapes so the pre-segment optimizaiton pass only needs to process some specfic types of reshape e.g. reshape includes a merge of iter ID and redu ID. |
@jjsjann123 , @wujingyue , and @naoyam
|
Trying to catch up what's going on here, but still confused why these two are not fused:
Doesn't the second reshape belong to the third segment?
|
LGTM. I'll defer to others whether we should fix the scheduler(s) to accept both types of reshapes. That sounds like the right fix to me even though it may take longer. |
Rejected by For example, a tv with [{i0}, {32}, {i2/32}, {i3}, {i4}], bold represents reduction dims.
I think we can safely skip these checks if the reshape only involves
Yes. |
I missed this part. I was expecting nvFuser not Thunder to do this. It sounds like a specific code motion to work around an nvFuser limitation. Doing this in nvFuser makes nvFuser standalone, not relying on a specific code pattern upstream. |
Thanks for the quick feedback. I'll check with other stakeholders and may schedule a short sync meeting. |
I see. I was only thinking about Have you thought about using IdModel to rewrite |
Agree that the proper approach is that nvfuser should be able to handle graph level optimization and re-order some trivial reshape to allow better fusion. We can discuss the priority on that. Meanwhile, the last reshape in thunder might be an easier thing to change with some slightly ugly code: https://github.com/Lightning-AI/lightning-thunder/blob/14e6c9b67eb038ab28a192cd381bc183b77e8f81/thunder/torch/__init__.py#L4406-L4420 |
… segmentation issue #2375 (#2405) This PR is step-2 to solve #2375 Allow output to alias intermediate when: (1) called from pre segment pass (2) the `op` that preduces `output` interfer with reduction, see `outputInterferingReduction` After this PR, the last reshape (which produces the ouput tensor) in group norm is changed to a no-op, see newly added test `OutputAliasIntermediate` Util function `getRepresentativeReductionTv` is added for convenience and will be used to simipify reduction/normalization schedulers. --------- Co-authored-by: Jingyue Wu <[email protected]>
…ops has ID merges (#2583) **Background** This PR is separated from #2437 by removing all commits using IdModels. I separate it out so we can merge it in the main branch and moving to step-2 of the group norm issue. The work to rewrite `reductionInterferingView` using IdModels will be continued in #2437. **---- PR description---** **Issue:** #2375 **Fix:** When the reshape transformations from root domains to logical domains consists only split, it is safe for the reduction scheduler to merge any two reduction domains. We can skip reductionInterferingView check. **Results:** With this PR, the reshape of input ([N,C,H,W] -> [N, G, C/G, H, W]) can be fused with normalization into one kernel. Group norm splits into 2 kernels instead of 3. Following works: (1) It also seems safe if the reshape transformations from root domains to logical domains contains a merge, but all of the merged IDs are mapped to reduction dims or iteration dims. Continued in #2437
The remaining issue is for channel last format, current reduction/normalization scheduler didn't consider allocation domains. |
Group norm is calculated as:
Due to the two reshapes, normalization scheduler rejects the unsegmented fusion and then it is segmented into three sub-fusions:
(1) pointwise doing
cast + reshape
(2) normalization
(3) pointwise doing
reshape + scale & bias
Reproduce (modified from apex group norm implementation and apex group norm test ):
The text was updated successfully, but these errors were encountered: