-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build Segments for User Schedule Segmentation #3334
Conversation
!test |
void FusionDefinition::finalizeSegmentation() { | ||
// Destroy SegmentedState | ||
segmentation_state_.reset(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a design question. What's the point of separating this into two steps?
Doesn't look like anything else is happening between setupSegmentation()
to finalizeSegmentation()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#3335 adds the buildSegmentation()
to translate the CPP segments to their corresponding python definition.
We still need to destroy the segmentation_state_
in this PR.
* Create SegmentationState * Move segmentation logic to a separate file
8457dde
to
fadad52
Compare
fadad52
to
a53d1dd
Compare
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My issues are addressed. Thx
## Overview: - `buildSegment` creates the CPP Fusion for a given segment id, translates it to a python FusionDefinition, then returns a mapping from the segment fusion state indices to the original fusion state indices. - `FusionDefinition.segment` calls `setupSegmentation`, `buildSegment`, and `finalizeSegmentation` to create python definitions for the sub-fusions and their index mappings. ## Changes in this PR This PR implements `buildSegment` function for user-scheduler segmentation. It is the second PR in a stack, preceded by #3334 and followed by #3025. 1. Implement `buildSegment` function in `csrc/python_frontend/segmentation.cpp`. 2. Complete `segment` function in `nvfuser/__init__.py` ## Example: ### Original Fusion: A reduction + broadcast + pointwise fusion. ```python def nvfuser_fusion_id1(fd : FusionDefinition) -> None : T0 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True], dtype=DataType.Float, is_cpu=False) T1 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True], dtype=DataType.Float, is_cpu=False) T2 = fd.ops.sum(T0, dims=[1], keepdim=False, dtype=DataType.Float) T3 = fd.ops.broadcast(T2, is_broadcast_dim=[False, True]) T4 = fd.ops.add(T1, T3) fd.add_output(T4) ``` **After Segmentation:** The reduction scheduler does not support fusing any operations with an inner reduction, so the original fusion is divided into two segments. ## First Segment: The first segment contains the reduction and broadcast operations, which corresponds with [T0, T2, T3] in the original fusion. Therefore, the segment index to original index map has two entries. | Segment Index | Original Index | Description | | -----------------| --------------- | ------------- | | T0 | T0 | The first tensor argument for the original fusion. | | T2 | T3 | The broadcasted, reduction tensor is this segment's output. | ```python def nvfuser_fusion_id2(fd : FusionDefinition) -> None : T0 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True], dtype=DataType.Float, is_cpu=False) T1 = fd.ops.sum(T0, dims=[1], keepdim=False, dtype=DataType.Float) T2 = fd.ops.broadcast(T1, is_broadcast_dim=[False, True]) fd.add_output(T2) ``` ## Second Segment: The second segment is the pointwise addition with the broadcasted reduction. It corresponds with [T1, T3, T4] in the original fusion. | Segment Index | Original Index | Description | | -----------------| --------------- | ------------- | | T0 | T1 | The second tensor argument for the original fusion. | | T1 | T3 | The broadcasted, reduction tensor, which is the output from the first segment. | | T2 | T4 | The pointwise addition, which is the output for the original fusion. | ```python def nvfuser_fusion_id3(fd : FusionDefinition) -> None : T0 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True], dtype=DataType.Float, is_cpu=False) T1 = fd.define_tensor(shape=[-1, 1], contiguity=[True, None], dtype=DataType.Float, is_cpu=False) T2 = fd.ops.add(T0, T1) fd.add_output(T2) ```
## Overview: The original `FusionDefinition` stores the sequence of sub-fusions and acts as an argument manager. It gathers the input arguments before running the sub-fusion and stores its results. To perform this function, it uses a map from the segment index space to the original index space. This mapping was generated while creating the python definition for each sub-fusion. ## Changes in this PR This PR implements `_execute_segments ` function for user-scheduler segmentation. It is the third PR in a stack, preceded by #3334 and #3335. 1. Implement `_execute_segments ` function in `nvfuser/__init__.py` to orchestrate segments in original fusion. 2. Add `supports_segmentation flag` to `exec_nvfuser`, so segmentation testing is enabled by default for all python tests. ## Example: ### Original Fusion: A reduction + broadcast + pointwise fusion. ```python def nvfuser_fusion_id1(fd : FusionDefinition) -> None : T0 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True], dtype=DataType.Float, is_cpu=False) T1 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True], dtype=DataType.Float, is_cpu=False) T2 = fd.ops.sum(T0, dims=[1], keepdim=False, dtype=DataType.Float) T3 = fd.ops.broadcast(T2, is_broadcast_dim=[False, True]) T4 = fd.ops.add(T1, T3) fd.add_output(T4) ``` ## Step-by-Step execution of `_execute_segments` ### Step 1 before running any segments. #### `map_original_fid_to_value` state: 6 entries | Original Index | Description | | -----------------| --------------- | | 0 | The first tensor argument for the original fusion. | | 1 | The second tensor argument for the original fusion. | | -1 | Extent of axis 0 for first tensor argument. | | -2 | Extent of axis 1 for first tensor argument. | | -3 | Extent of axis 0 for second tensor argument. | | -4 | Extent of axis 1 for second tensor argument. | * Omit extents [-1, -4] from table in future steps because they are not necessary for these segments. ### Step 2 after running the first segment. #### `map_original_fid_to_value` state: 6 entries | Original Index | Description | | -----------------| --------------- | | 1 | The second tensor argument for the original fusion. | | 3 | The broadcasted, reduction tensor, which is the output from the first segment. | * Removed the entry for `T0` because the first tensor argument is not required for second segment. * Added the entry for `T3`, which is the output from the first segment. ### Step 3 after running the second segment. #### `map_original_fid_to_value` state: 5 entries | Original Index | Description | | -----------------| --------------- | | 4 | The pointwise addition, which is the output for the original fusion. | * Removed the entries for `T1` and `T3` because they are not necessary anymore. * Added the entry for `T4`, which is the output from the second segment. ### Step 4 after running all segments. * Return `T4` from `map_original_fid_to_value` as the result for the original fusion.
General Overview of Segmentation:
Segmentation decomposes a fusion into a directed acyclic graph (DAG) of sub-fusions. After applying the segmentation algorithm, we can translate the sub-fusions into their corresponding python definitions. Then, given the fusion's input arguments, the segments are run in the correct order to produce the output results.
The original FusionDefinition stores the sequence of sub-fusions and acts as an argument manager. It gathers the input arguments before running the sub-fusion and stores its results. To perform this function, it requires a map from the segment index space to the original index space. This mapping is generated while creating the python definition for each sub-fusion.
CPP functions:
Step 1:
setupSegmentation
runs the segmentation algorithm on the CPP Fusion to create theSegmentedFusion
. Then, sub-fusions are ordered according to their dependencies by theprepareGroupOrder
function. It returns the number of segments inSegmentedFusion
.Step 2:
buildSegment
creates the CPPFusion
for a given segment id, translates it to a pythonFusionDefinition
, then returns a mapping from the segment fusion state indices to the original fusion state indices.Step 3:
finalizeSegmentation
destroys any state stored inFusionDefinition
.Python functions:
setupSegmentation
,buildSegment
, andfinalizeSegmentation
are called together inFusionDefinition.segment
.FusionDefinition
has segments, call_execute_segments
in theFusionDefinition.execute
. The originalFusionDefinition
acts as argument manager, running the sub-fusions in topological order.Example:
Original Fusion: A reduction + broadcast + pointwise fusion.
After Segmentation:
First Segment:
Second Segment:
Changes in this PR
This PR implements
setupSegmentation
function for user-scheduler segmentation. It is the first PR in a stack, followed by #3335 and #3025.SegmentationState
class that contains all segmentation logic for python-frontend.csrc/python_frontend/segmentation.h
FusionDefinition
contains an instantiation ofSegmentationState
and exposes its logic in a public interface. This interface is added to the python bindings.test_segmentation_reduction_pointwise_epilogue
to test functionality.