-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial resize scheduler #3556
base: main
Are you sure you want to change the base?
Initial resize scheduler #3556
Conversation
5bde3d4
to
7e7db61
Compare
@@ -4096,64 +4108,85 @@ TEST_F(ResizeTest, PropagateSliceToInputs) { | |||
auto tv0 = makeConcreteTensor(shape); | |||
fusion.addInput(tv0); | |||
|
|||
auto tv1 = set(tv0); | |||
// Dont't use set here as it gets taken by the no-op scheduler | |||
auto tv1 = sin(tv0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes from set
to sin
or cos
are just to avoid the preseg transformation from kicking in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing changed with the tests here (except replacing set
with sin
and one disabled test) but just extended some of the existing tests to use the resize scheduler as well. Not all patterns are supported yet, so they just call GTEST_SKIP
for now.
csrc/scheduler/tools/domain_map.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just moved from pointwise_utils.h
csrc/scheduler/tools/domain_map.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just moved from pointwise_utils to domain_map
|
||
namespace nvfuser { | ||
|
||
bool ResizeScheduler::canScheduleCompileTime(Fusion* fusion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this initial version, I'm trying to make it very restrictive. Will have several follow-up PRs to schedule the whole RoPE module.
#include <scheduler/utils.h> | ||
|
||
namespace nvfuser { | ||
namespace pointwise_utils { | ||
|
||
// DomainMap uses the ComputeAtMap to find a reference TensorView |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is moved to scheduler/tools/domain_map.h
@@ -29,37 +29,6 @@ namespace { | |||
// Unused at the moment, commenting for clang tidy | |||
constexpr int64_t kThreadX = 128; | |||
|
|||
class DomainMap : public pointwise_utils::DomainMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is moved to pointwise_utils.h so that it can be also used from the resize scheduler
csrc/scheduler/pointwise_utils.h
Outdated
@@ -74,5 +30,44 @@ inline int64_t nRootDims(const TensorView* tv) { | |||
return tv_n_dims; | |||
} | |||
|
|||
class DomainMap : public scheduler_tools::DomainMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is moved from pointwise.cpp
@@ -432,19 +403,11 @@ std::unique_ptr<PointwiseParams> getPointwiseHeuristics( | |||
return params; | |||
} | |||
|
|||
// Return reference tensor view. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just moved to pointwise_utils
csrc/scheduler/pointwise_utils.h
Outdated
}; | ||
|
||
// Return reference tensor view. | ||
inline TensorView* getReferenceTensor(Fusion* fusion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from pointwise.cpp. Also shortened the name a bit (was getReferenceTensorView
)
!test |
tests/cpp/test_alias.cpp
Outdated
@@ -520,6 +520,9 @@ TEST_F(AliasTest, AliasOutputBeforeNonAliasOutput) { | |||
testValidate( | |||
executor_cache.fusion(), out_tensors, {in_tensor}, __LINE__, __FILE__); | |||
|
|||
// TODO: Fix the alias support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is broken for now. Need to understand how it actually works before this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -959,34 +962,6 @@ TEST_F(AliasTest, SourceIsBothInputAndOutput) { | |||
EXPECT_EQ(in_tensor.data_ptr(), out_tensors[1].data_ptr()); | |||
} | |||
|
|||
TEST_F(AliasTest, SegmentBoundary) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not relevant as this isn't segmented anymore
const auto num_segments = kernel_runtime->fusionSegments()->groups().size(); | ||
NVF_CHECK(num_segments == 3, "Expect 3 segments, got: ", num_segments); | ||
for (const auto& exec : kernel_runtime->executors()) { | ||
EXPECT_EQ(num_segments, 2) << "Expect 2 segments, got: " << num_segments; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now just segmented to two kernels
if (!exec->isA<KernelExecutor>()) { | ||
continue; | ||
} | ||
if (kernel_runtime->schedulerHeuristics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gmem requirement isn't relevant for the resize scheduler
4ad2ff7
to
e8cb381
Compare
csrc/scheduler/pointwise_utils.h
Outdated
TensorView* result = nullptr; | ||
int64_t max_dims = -1; | ||
for (auto output_tv : | ||
ir_utils::filterByType<TensorView>(fusion_->outputs())) { | ||
if (isValidReference(output_tv) && | ||
hasMinimumSize(output_tv, minimum_num_axes) && | ||
!output_tv->isFusionInput()) { | ||
int64_t n_dims = pointwise_utils::nRootDims(output_tv); | ||
if (n_dims > max_dims) { | ||
result = output_tv; | ||
max_dims = n_dims; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'd put long implementations like this to cpp for faster compilation. Otherwise, it would have to be compiled for each compilation unit including this header, which seems unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed, the long compilation time is painful...
tests/cpp/test_gpu3.cpp
Outdated
slice->in()->getMemoryType() == MemoryType::Global, | ||
"slice input must be in global memory, get: ", | ||
slice->in()->getMemoryType()); | ||
EXPECT_EQ(slice->in()->getMemoryType(), MemoryType::Global) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this check still makes sense with this PR. Don't you want all slices picked up by the new resize scheduler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought not all segments may be picked up by the resize scheduler yet, but let me check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fusion is now segmented as:
group details:
g{(resize)
group id: 0
inputs:
T1_g___half[iS2{i3}] __half
outputs:
T11_g_float[bS26{1}, iS61{4096}] float
T17_g___half[iS44{8}rf, iS45{( ceilDiv(i3, 8) )}rf] __half
T17_g___half[iS44{8}rf, iS45{( ceilDiv(i3, 8) )}rf] = view( T1_g___half[iS2{i3}] )
(21)
T8_g___half[iS56{8}, iS18{512}rf]
= slice( T17_g___half[iS44{8}rf, iS45{( ceilDiv(i3, 8) )}rf], { {0, 8, 1} {0, 512, 1} } )
(22)
T18_g___half[iS59{4096}rf] = view( T8_g___half[iS56{8}, iS18{512}rf] )
(25)
T10_g___half[bS24{1}, iS60{4096}]
= broadcast( T18_g___half[iS59{4096}rf] )
(26)
T11_g_float[bS26{1}, iS61{4096}]
= __half2float(T10_g___half[bS24{1}, iS60{4096}]);
(11)
i97 = ceilDiv(i3, 8);
(19)
}
g{(inner_persistent)
group id: 1
inputs:
T0_g___half[iS0{i0}, iS68{4096}] __half
T17_g___half[iS44{8}rf, iS45{( ceilDiv(i3, 8) )}rf] __half
outputs:
T16_g_float[iS40{i0}, iS41{4096}] float
T2_l_float[iS3{i0}, iS69{4096}]
= __half2float(T0_g___half[iS0{i0}, iS68{4096}]);
(0)
T3_g_float[iS5{i0}, iS70{4096}]
= expf(T2_l_float[iS3{i0}, iS69{4096}]);
(1)
T4_l_float[iS7{i0}, rS71{4096}]
= reduction( T3_g_float[iS5{i0}, iS70{4096}], op = add, initial value = float(0), allreduce = false )
(2)
T5_g_float[iS9{i0}, bS10{1}]
= broadcast( T4_l_float[iS7{i0}, rS71{4096}] )
(3)
T6_g_float[iS11{i0}, iS72{4096}]
= T3_g_float[iS5{i0}, iS70{4096}]
/ T5_g_float[iS9{i0}, bS10{1}];
(4)
T12_g___half[iS62{8}, iS30{512}rf]
= slice( T17_g___half[iS44{8}rf, iS45{( ceilDiv(i3, 8) )}rf], { {0, 8, 1} {1536, 2048, 1} } )
(23)
T19_l___half[iS65{4096}rf] = view( T12_g___half[iS62{8}, iS30{512}rf] )
(28)
T14_g___half[bS36{1}, iS66{4096}]
= broadcast( T19_l___half[iS65{4096}rf] )
(29)
T15_g_float[bS38{1}, iS67{4096}]
= __half2float(T14_g___half[bS36{1}, iS66{4096}]);
(17)
T16_g_float[iS40{i0}, iS41{4096}]
= T6_g_float[iS11{i0}, iS72{4096}]
* T15_g_float[bS38{1}, iS67{4096}];
(18)
}
The inner-persistent segment also has a slice, so the check still makes for that segment.
And looks like there's some inefficient segmentation here. T17
is the output of the first reshape, which is consumed by the two slices. The reshape is done by the first segment, and the reshape output is an output of the first segment. Instead, the second segment should just load T1
again and do the same reshape. That would be more efficient as there's no store of T17
and the data type of T1
is half, whereas T17
is float.
It's probably related to the uop recomputation. It seems we should consider reshape as a uop for the purpose of segmentation optimization. CC: @liqiangxl
The issue is no longer relevant if the loop generation is also done using IdModel (i.e., enabling the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
const auto outermost_pos = (int64_t)old2new.size(); | ||
ref_tv->flatten(outermost_pos); | ||
ref_tv->split(outermost_pos, 128); | ||
ref_tv->split(outermost_pos, 1 << 14); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little off-topic but is there any particular reason to split by 16K here? In the pointwise scheduler I think the split is into 64K blocks. Would it be better to just make this a heuristic param so that it could be set to some multiple of the number of SMs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing particular. This is just a random simple scheduling I put here for now. These heuristics parameters would need to go through the tuning process once all the building blocks are in place (much like the matmul scheduler development).
The multiple uses of the name |
Yeah, I thought so too, but they are also in its own namespace like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mechanical changes looks straightforward to me. Taking a look at the actual scheduler + tests.
|
||
inlineMost(); | ||
|
||
markAliases(fusion); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tagging @wujingyue , I see this one in pointwise / reduction / no-op scheduler.
shouldn't this be called after any schedule is done in general?
I think redundancy is better than ambiguity, so I changed the name as you suggested @jacobhinkle |
…bolicSizes) (#3578) Stacked on #3585 `StmtSort::getStmtsTo` may not grab all active iter domains if IDs are connected in an unconventional way. For example, we can set the loop domain of a tensor as a producer of its logical domain, but due to the nature of `IterVisitor`, such ID dependency patterns are not supported, meaning `StmtSort::getStmtsTo` would fail to grab all valid IDs and their exprs. I just recently noticed this issue while working on #3556, specifically the issue got exposed as an inconsistent replacement of extent vals. I've been experimenting such patterns of domains, but I hadn't seen this before, likely because I was using just static shape tensors for convenience. To fix the issue, I added a variation of `StmtSort::getStmtsTo`, which traverses a fusion as usual but stops at TensorView. For each TensorView, instead of using `IterVisitor`, it uses `TensorDomain::getAllStatements()`, which combines both `TensorDomain::allIDs()` and `TensorDomain::allExprs()`, and traverse the IDs and exprs in the returned order. It's a bit naive implementation, but I think this is good enough for now and also I don't have any other immediate idea to try. I changed `ValReplacementMutator` to use the new interface. That's the only use for now. --------- Co-authored-by: Jacob Hinkle <[email protected]>
!test |
!test |
This is a very preliminary version of a new scheduler mainly targeted for RoPE. I will incrementally extend this scheduler to be more flexible and performant, but for now it only handles a fusion that has pointwise ops and a single
Resize
-based tensor op such asSliceOp
andPadOp
. The scheduling strategy is currently pretty naive too and is manually demonstrated at #3549 and #3555, but the main point is that inputs of resize-based tensor ops likeSliceOp
orPadOp
no longer need to have their inputs as fusion inputs.The new scheduler is currently placed after the reduction scheduler and before the transpose and pointwise schedulers:
https://github.com/NVIDIA/Fuser/pull/3556/files#diff-c0d261d44c61935fa2d5398f0ac52bd6ea077c6892fb5629c03a425a55fc32f2R64-R74
There are several small changes with some of the existing tests, mainly those on segmentation and alias support since this new scheduler may change how a fusion is segmented when resize is used. There's one thing I haven't addressed (#3556 (comment)), which I'm tracking with a separate issue.