Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial resize scheduler #3556

Open
wants to merge 26 commits into
base: main
Choose a base branch
from
Open

Conversation

naoyam
Copy link
Collaborator

@naoyam naoyam commented Dec 10, 2024

This is a very preliminary version of a new scheduler mainly targeted for RoPE. I will incrementally extend this scheduler to be more flexible and performant, but for now it only handles a fusion that has pointwise ops and a single Resize-based tensor op such as SliceOp and PadOp. The scheduling strategy is currently pretty naive too and is manually demonstrated at #3549 and #3555, but the main point is that inputs of resize-based tensor ops like SliceOp or PadOp no longer need to have their inputs as fusion inputs.

The new scheduler is currently placed after the reduction scheduler and before the transpose and pointwise schedulers:

SchedulerType::ExprEval,
    SchedulerType::NoOp,
    SchedulerType::Matmul,
    SchedulerType::Reduction,
    SchedulerType::Resize, <-- New
    SchedulerType::Transpose,
    SchedulerType::PointWise,
    SchedulerType::InnerPersistent,
    SchedulerType::OuterPersistent,
    SchedulerType::InnerOuterPersistent};

https://github.com/NVIDIA/Fuser/pull/3556/files#diff-c0d261d44c61935fa2d5398f0ac52bd6ea077c6892fb5629c03a425a55fc32f2R64-R74

There are several small changes with some of the existing tests, mainly those on segmentation and alias support since this new scheduler may change how a fusion is segmented when resize is used. There's one thing I haven't addressed (#3556 (comment)), which I'm tracking with a separate issue.

@naoyam naoyam force-pushed the resize_scheduler_initial_version branch 2 times, most recently from 5bde3d4 to 7e7db61 Compare December 10, 2024 20:05
@@ -4096,64 +4108,85 @@ TEST_F(ResizeTest, PropagateSliceToInputs) {
auto tv0 = makeConcreteTensor(shape);
fusion.addInput(tv0);

auto tv1 = set(tv0);
// Dont't use set here as it gets taken by the no-op scheduler
auto tv1 = sin(tv0);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes from set to sin or cos are just to avoid the preseg transformation from kicking in.

Copy link
Collaborator Author

@naoyam naoyam Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing changed with the tests here (except replacing set with sin and one disabled test) but just extended some of the existing tests to use the resize scheduler as well. Not all patterns are supported yet, so they just call GTEST_SKIP for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just moved from pointwise_utils.h

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just moved from pointwise_utils to domain_map


namespace nvfuser {

bool ResizeScheduler::canScheduleCompileTime(Fusion* fusion) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this initial version, I'm trying to make it very restrictive. Will have several follow-up PRs to schedule the whole RoPE module.

#include <scheduler/utils.h>

namespace nvfuser {
namespace pointwise_utils {

// DomainMap uses the ComputeAtMap to find a reference TensorView
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is moved to scheduler/tools/domain_map.h

@@ -29,37 +29,6 @@ namespace {
// Unused at the moment, commenting for clang tidy
constexpr int64_t kThreadX = 128;

class DomainMap : public pointwise_utils::DomainMap {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is moved to pointwise_utils.h so that it can be also used from the resize scheduler

@@ -74,5 +30,44 @@ inline int64_t nRootDims(const TensorView* tv) {
return tv_n_dims;
}

class DomainMap : public scheduler_tools::DomainMap {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved from pointwise.cpp

Base automatically changed from rotation_residual_support to main December 10, 2024 22:46
@@ -432,19 +403,11 @@ std::unique_ptr<PointwiseParams> getPointwiseHeuristics(
return params;
}

// Return reference tensor view.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just moved to pointwise_utils

};

// Return reference tensor view.
inline TensorView* getReferenceTensor(Fusion* fusion) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from pointwise.cpp. Also shortened the name a bit (was getReferenceTensorView)

@naoyam
Copy link
Collaborator Author

naoyam commented Dec 11, 2024

!test

@@ -520,6 +520,9 @@ TEST_F(AliasTest, AliasOutputBeforeNonAliasOutput) {
testValidate(
executor_cache.fusion(), out_tensors, {in_tensor}, __LINE__, __FILE__);

// TODO: Fix the alias support
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is broken for now. Need to understand how it actually works before this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -959,34 +962,6 @@ TEST_F(AliasTest, SourceIsBothInputAndOutput) {
EXPECT_EQ(in_tensor.data_ptr(), out_tensors[1].data_ptr());
}

TEST_F(AliasTest, SegmentBoundary) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not relevant as this isn't segmented anymore

const auto num_segments = kernel_runtime->fusionSegments()->groups().size();
NVF_CHECK(num_segments == 3, "Expect 3 segments, got: ", num_segments);
for (const auto& exec : kernel_runtime->executors()) {
EXPECT_EQ(num_segments, 2) << "Expect 2 segments, got: " << num_segments;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now just segmented to two kernels

if (!exec->isA<KernelExecutor>()) {
continue;
}
if (kernel_runtime->schedulerHeuristics()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gmem requirement isn't relevant for the resize scheduler

@naoyam naoyam force-pushed the resize_scheduler_initial_version branch from 4ad2ff7 to e8cb381 Compare December 11, 2024 09:22
@naoyam naoyam changed the base branch from main to enable_id_model_for_resize December 11, 2024 09:22
Comment on lines 40 to 53
TensorView* result = nullptr;
int64_t max_dims = -1;
for (auto output_tv :
ir_utils::filterByType<TensorView>(fusion_->outputs())) {
if (isValidReference(output_tv) &&
hasMinimumSize(output_tv, minimum_num_axes) &&
!output_tv->isFusionInput()) {
int64_t n_dims = pointwise_utils::nRootDims(output_tv);
if (n_dims > max_dims) {
result = output_tv;
max_dims = n_dims;
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd put long implementations like this to cpp for faster compilation. Otherwise, it would have to be compiled for each compilation unit including this header, which seems unnecessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed, the long compilation time is painful...

tests/cpp/test_gpu3.cpp Outdated Show resolved Hide resolved
tests/cpp/test_gpu3.cpp Outdated Show resolved Hide resolved
slice->in()->getMemoryType() == MemoryType::Global,
"slice input must be in global memory, get: ",
slice->in()->getMemoryType());
EXPECT_EQ(slice->in()->getMemoryType(), MemoryType::Global)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this check still makes sense with this PR. Don't you want all slices picked up by the new resize scheduler?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought not all segments may be picked up by the resize scheduler yet, but let me check.

Copy link
Collaborator Author

@naoyam naoyam Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fusion is now segmented as:

group details:
g{(resize)
group id: 0
inputs:
  T1_g___half[iS2{i3}] __half
outputs:
  T11_g_float[bS26{1}, iS61{4096}] float
  T17_g___half[iS44{8}rf, iS45{( ceilDiv(i3, 8) )}rf] __half


T17_g___half[iS44{8}rf, iS45{( ceilDiv(i3, 8) )}rf] = view( T1_g___half[iS2{i3}] )
(21)
T8_g___half[iS56{8}, iS18{512}rf]
   = slice( T17_g___half[iS44{8}rf, iS45{( ceilDiv(i3, 8) )}rf], { {0, 8, 1} {0, 512, 1} } )
(22)
T18_g___half[iS59{4096}rf] = view( T8_g___half[iS56{8}, iS18{512}rf] )
(25)
T10_g___half[bS24{1}, iS60{4096}]
   = broadcast( T18_g___half[iS59{4096}rf] )
(26)
T11_g_float[bS26{1}, iS61{4096}]
   = __half2float(T10_g___half[bS24{1}, iS60{4096}]);
(11)
i97 = ceilDiv(i3, 8);
(19)
}

g{(inner_persistent)
group id: 1
inputs:
  T0_g___half[iS0{i0}, iS68{4096}] __half
  T17_g___half[iS44{8}rf, iS45{( ceilDiv(i3, 8) )}rf] __half
outputs:
  T16_g_float[iS40{i0}, iS41{4096}] float


T2_l_float[iS3{i0}, iS69{4096}]
   = __half2float(T0_g___half[iS0{i0}, iS68{4096}]);
(0)
T3_g_float[iS5{i0}, iS70{4096}]
   = expf(T2_l_float[iS3{i0}, iS69{4096}]);
(1)
T4_l_float[iS7{i0}, rS71{4096}]
   = reduction( T3_g_float[iS5{i0}, iS70{4096}], op = add, initial value = float(0), allreduce = false )
(2)
T5_g_float[iS9{i0}, bS10{1}]
   = broadcast( T4_l_float[iS7{i0}, rS71{4096}] )
(3)
T6_g_float[iS11{i0}, iS72{4096}]
   = T3_g_float[iS5{i0}, iS70{4096}]
   / T5_g_float[iS9{i0}, bS10{1}];
(4)
T12_g___half[iS62{8}, iS30{512}rf]
   = slice( T17_g___half[iS44{8}rf, iS45{( ceilDiv(i3, 8) )}rf], { {0, 8, 1} {1536, 2048, 1} } )
(23)
T19_l___half[iS65{4096}rf] = view( T12_g___half[iS62{8}, iS30{512}rf] )
(28)
T14_g___half[bS36{1}, iS66{4096}]
   = broadcast( T19_l___half[iS65{4096}rf] )
(29)
T15_g_float[bS38{1}, iS67{4096}]
   = __half2float(T14_g___half[bS36{1}, iS66{4096}]);
(17)
T16_g_float[iS40{i0}, iS41{4096}]
   = T6_g_float[iS11{i0}, iS72{4096}]
   * T15_g_float[bS38{1}, iS67{4096}];
(18)
}

The inner-persistent segment also has a slice, so the check still makes for that segment.

And looks like there's some inefficient segmentation here. T17 is the output of the first reshape, which is consumed by the two slices. The reshape is done by the first segment, and the reshape output is an output of the first segment. Instead, the second segment should just load T1 again and do the same reshape. That would be more efficient as there's no store of T17 and the data type of T1 is half, whereas T17 is float.

It's probably related to the uop recomputation. It seems we should consider reshape as a uop for the purpose of segmentation optimization. CC: @liqiangxl

@naoyam
Copy link
Collaborator Author

naoyam commented Dec 11, 2024

test_litgpt_variants_gpt_neox_like fails because expr sort can't understand the inlining validity of a loop domain scheduled by setLoopDomainsLike. That seems to be due to squeezed broadcast IDs.

The issue is no longer relevant if the loop generation is also done using IdModel (i.e., enabling the loop option of IdModel too). I'll do that as a follow-up PR, but for now I'm going to further restrict canScheduleCompileTime to avoid the failure.

@naoyam naoyam mentioned this pull request Dec 11, 2024
1 task
Copy link
Collaborator

@jacobhinkle jacobhinkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

csrc/scheduler/resize.cpp Outdated Show resolved Hide resolved
const auto outermost_pos = (int64_t)old2new.size();
ref_tv->flatten(outermost_pos);
ref_tv->split(outermost_pos, 128);
ref_tv->split(outermost_pos, 1 << 14);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little off-topic but is there any particular reason to split by 16K here? In the pointwise scheduler I think the split is into 64K blocks. Would it be better to just make this a heuristic param so that it could be set to some multiple of the number of SMs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing particular. This is just a random simple scheduling I put here for now. These heuristics parameters would need to go through the tuning process once all the building blocks are in place (much like the matmul scheduler development).

@jacobhinkle
Copy link
Collaborator

The multiple uses of the name DomainMap might be a little confusing even though they're in different namespaces/files. You could call one a PointwiseDomainMap or something.

@naoyam
Copy link
Collaborator Author

naoyam commented Dec 12, 2024

The multiple uses of the name DomainMap might be a little confusing even though they're in different namespaces/files. You could call one a PointwiseDomainMap or something.

Yeah, I thought so too, but they are also in its own namespace like pointwise_utils, so it would look like pointwise_utils::PointwiseDomainMap, which I thought may look redundant.

csrc/scheduler/resize.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mechanical changes looks straightforward to me. Taking a look at the actual scheduler + tests.

csrc/scheduler/pointwise_utils.cpp Outdated Show resolved Hide resolved
csrc/scheduler/resize.cpp Show resolved Hide resolved
csrc/scheduler/resize.cpp Show resolved Hide resolved
csrc/scheduler/resize.cpp Show resolved Hide resolved

inlineMost();

markAliases(fusion);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tagging @wujingyue , I see this one in pointwise / reduction / no-op scheduler.

shouldn't this be called after any schedule is done in general?

@naoyam
Copy link
Collaborator Author

naoyam commented Dec 13, 2024

The multiple uses of the name DomainMap might be a little confusing even though they're in different namespaces/files. You could call one a PointwiseDomainMap or something.

Yeah, I thought so too, but they are also in its own namespace like pointwise_utils, so it would look like pointwise_utils::PointwiseDomainMap, which I thought may look redundant.

I think redundancy is better than ambiguity, so I changed the name as you suggested @jacobhinkle

naoyam added a commit that referenced this pull request Dec 13, 2024
…bolicSizes) (#3578)

Stacked on #3585 

`StmtSort::getStmtsTo` may not grab all active iter domains if IDs are
connected in an unconventional way. For example, we can set the loop
domain of a tensor as a producer of its logical domain, but due to the
nature of `IterVisitor`, such ID dependency patterns are not supported,
meaning `StmtSort::getStmtsTo` would fail to grab all valid IDs and
their exprs.

I just recently noticed this issue while working on #3556, specifically
the issue got exposed as an inconsistent replacement of extent vals.
I've been experimenting such patterns of domains, but I hadn't seen this
before, likely because I was using just static shape tensors for
convenience.

To fix the issue, I added a variation of `StmtSort::getStmtsTo`, which
traverses a fusion as usual but stops at TensorView. For each
TensorView, instead of using `IterVisitor`, it uses
`TensorDomain::getAllStatements()`, which combines both
`TensorDomain::allIDs()` and `TensorDomain::allExprs()`, and traverse
the IDs and exprs in the returned order.

It's a bit naive implementation, but I think this is good enough for now
and also I don't have any other immediate idea to try.

I changed `ValReplacementMutator` to use the new interface. That's the
only use for now.

---------

Co-authored-by: Jacob Hinkle <[email protected]>
@naoyam
Copy link
Collaborator Author

naoyam commented Dec 13, 2024

!test

@naoyam
Copy link
Collaborator Author

naoyam commented Dec 13, 2024

!test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants