Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial_loop_domain_ to TensorDomain #2987

Merged
merged 1 commit into from
Sep 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions csrc/ir/interface_nodes.h
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,10 @@ class NVF_API TensorView : public Val {
return domain()->loop();
};

const std::vector<IterDomain*>& getInitialLoopDomain() const {
return domain()->initialLoop();
};

// If allocation domain exists in domain() return it, otherwise return
// logical domain
const std::vector<IterDomain*>& getMaybeAllocationDomain() const {
Expand Down
14 changes: 14 additions & 0 deletions csrc/ir/internal_base_nodes.h
Original file line number Diff line number Diff line change
Expand Up @@ -568,11 +568,21 @@ class TensorDomain : public Val {
return loop_domain_;
}

const std::vector<IterDomain*>& initialLoop() const {
return initial_loop_domain_;
}

// Check if id is a loop ID.
bool isLoop(const IterDomain* id) const {
return std::find(loop().begin(), loop().end(), id) != loop().end();
}

// Check if id is an intial loop ID.
bool isInitialLoop(const IterDomain* id) const {
return std::find(initialLoop().begin(), initialLoop().end(), id) !=
loop().end();
}

// Get all IDs that is on the shortest path between any of the domains
// (logical domain, root domain, loop domain, allocation domain) following
// definition and uses path. Return values are topologically ordered and
Expand Down Expand Up @@ -695,6 +705,10 @@ class TensorDomain : public Val {
const std::vector<IterDomain*> logical_domain_;
std::vector<IterDomain*> allocation_domain_;
std::vector<IterDomain*> loop_domain_;
// Initial loop domain. Loop domain is updated with transformations
// such as split, but the initial loop domain can only change with
// setLoopDomain
std::vector<IterDomain*> initial_loop_domain_;
std::vector<IterDomain*> additional_ids_;

std::vector<IterDomain*> no_bcast_domain_;
Expand Down
16 changes: 7 additions & 9 deletions csrc/ir/nodes.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3044,6 +3044,7 @@ TensorDomain::TensorDomain(
logical_domain_(std::move(logical_domain)),
allocation_domain_(std::move(allocation_domain)),
loop_domain_(std::move(loop_domain)),
initial_loop_domain_(loop_domain_),
contiguity_(
contiguity.empty() ? getContiguityFilledWith(maybeAllocation(), false)
: std::move(contiguity)) {
Expand Down Expand Up @@ -3073,6 +3074,7 @@ TensorDomain::TensorDomain(IrBuilderPasskey passkey, const TensorDomain* src)
logical_domain_(src->logical_domain_),
allocation_domain_(src->allocation_domain_),
loop_domain_(src->loop_domain_),
initial_loop_domain_(src->initial_loop_domain_),
additional_ids_(src->additional_ids_),
no_bcast_domain_(src->no_bcast_domain_),
no_reduction_domain_(src->no_reduction_domain_),
Expand All @@ -3085,6 +3087,7 @@ TensorDomain::TensorDomain(const TensorDomain* src, IrCloner* ir_cloner)
logical_domain_(ir_cloner->clone(src->logical_domain_)),
allocation_domain_(ir_cloner->clone(src->allocation_domain_)),
loop_domain_(ir_cloner->clone(src->loop_domain_)),
initial_loop_domain_(ir_cloner->clone(src->initial_loop_domain_)),
additional_ids_(ir_cloner->clone(src->additional_ids_)),
no_bcast_domain_(ir_cloner->clone(src->no_bcast_domain_)),
no_reduction_domain_(ir_cloner->clone(src->no_reduction_domain_)),
Expand Down Expand Up @@ -3614,6 +3617,7 @@ void TensorDomain::setLoopDomain(std::vector<IterDomain*> new_loop_domain) {
". Logical: ",
toDelimitedString(logical_domain_));
loop_domain_ = std::move(new_loop_domain);
initial_loop_domain_ = loop_domain_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just append everything in loop_domain_ here to additional_ids_? In allIDs, we might want to do:

for (auto i : c10::irange(all_domains.size() - 1)) {
  for (auto j : c10::irange(all_domains.size() - 1)) {

instead of

for (auto i : c10::irange(all_domains.size() - 1)) {
  for (auto j : c10::irange(i + 1, all_domains.size())) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did think about it, but since additional_ids_ are just "additional", it seems to me it could be useful in some cases. For example, we could cache all IDs between logical and initial_loop to speed up TensorDomain::allIDs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additional_ids_ was the mechanism designed to allow loop domain to have a new broadcast, and the reason why it exists was to solve almost exactly the same problem as this PR solves. Moving forward, looks like we will not continue with the "allow loop domain to have a new broadcast" direction. But I don't want to have two mechanisms for the same thing. We should pick either of them, but only one should be picked. It makes sense to speed up TensorDomain::allIDs, and it might also make sense to call the data structure we use for the speedup additional_ids_, but it will be a new thing that is totally unrelated to today's additional_ids_ except they happen to have the same name.

Copy link
Collaborator

@zasdfgbnm zasdfgbnm Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which mechanism to pick I believe is mostly a question of interface design. Today, we have:

void TensorDomain::broadcast(int64_t axis, Val* extent) {
  axis = nvfuser::wrapDim(axis, nDims() + 1);
  IterDomain* id = IterDomainBuilder(fusion()->zeroVal(), extent)
                       .iter_type(IterType::Broadcast)
                       .build();
  loop_domain_.insert(loop_domain_.begin() + axis, id);
  additional_ids_.push_back(id);
}

That creates a new broadcast on the loop domain. But if we just do setLoopDomain(something with manually created new broadcasts), we will fail.

And in this PR, we are not using an interface like TensorDomain::broadcast, so it will not have the problem of the above mechanism. But if we just do a setLoopDomain(getLoopDomain()), we will fail.

In summary, I believe neither the additional_ids_ mechanism nor the one in this PR is safe. The additional_ids_ is more convenient for the TensorDomain::broadcast-like interface, that is, we have a method function that says "add this new ID to this axis". And the one in this PR is more convenient for setLoopDomain way of inserting new IDs.

I think my question for now is: do we want to pause a bit to think about interface design, or do we want to just quickly unblock for now and do some cleanup later?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our offline discussion, we decided that it's more important to move forward and get some progress before too much worrying about interface designs. See also #3000

resetDomains();
}

Expand All @@ -3630,17 +3634,11 @@ void TensorDomain::setAllocationDomain(
}

std::vector<IterDomain*> TensorDomain::allIDs() const {
// loop_domain_ must be the first domain since loop domains are
// allowed to have extra domains that may not exist in other
// domains and IRBFS::getExprsBetween is not symmetric with respect
// to its two domain parameters. For example, it can find all exprs
// from a loop domain to a logical domain but may miss from logical
// to loop. See NVFuserTest.AllIDsWithExtraLoopIDs for a concrete
// example.
std::array<const std::vector<IterDomain*>*, 5> all_domains = {
&loop_domain_,
std::array<const std::vector<IterDomain*>*, 6> all_domains = {
&logical_domain_,
&root_domain_,
&initial_loop_domain_,
&loop_domain_,
&allocation_domain_,
&additional_ids_};
VectorOfUniqueEntries<IterDomain*> discovered_ids;
Expand Down
82 changes: 81 additions & 1 deletion tests/cpp/test_gpu3.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6513,7 +6513,7 @@ TEST_F(NVFuserTest, CompareLogicalAndLoopDomains) {
"Not all logical IDs are covered by loop domain")));
}

TEST_F(NVFuserTest, AllIDsWithExtraLoopIDs) {
TEST_F(NVFuserTest, AllIDsWithExtraLoopIDs1) {
Fusion fusion;
FusionGuard fg(&fusion);

Expand Down Expand Up @@ -6582,6 +6582,86 @@ TEST_F(NVFuserTest, AllIDsWithExtraLoopIDs) {
EXPECT_EQ(tv2_all_id_set, tv2_all_ids_ref);
}

TEST_F(NVFuserTest, AllIDsWithExtraLoopIDs2) {
Fusion fusion;
FusionGuard fg(&fusion);

// [i0, i1]
auto tv0 = makeSymbolicTensor(2);
fusion.addInput(tv0);
// [i0]
auto tv1 = makeSymbolicTensor(1);
fusion.addInput(tv1);

// [i0]
auto tv2 = set(tv1);
// [i0, b1]
auto tv3 = broadcast(tv2, {false, true});
// [i0, i1]
auto tv4 = add(tv0, tv3);
fusion.addOutput(tv4);

// Set the loop domain of tv2 the same as tv4. The new loop domain
// includes an ID that is not reachable from tv2 logical domain
auto tv2_inner_loop_domain =
tv4->getLoopDomain().at(1)->cloneWithoutRFactor();
std::vector<IterDomain*> tv2_initial_loop_domain{
tv2->getLogicalDomain().at(0), tv2_inner_loop_domain};
tv2->setLoopDomain(tv2_initial_loop_domain);

// Schedule only the extra dommain
tv2->split(1, 4);
auto tv2_split = tv2->axis(1)->definition();

// tv2 logical: [i0]
// split(i1) -> i1/4, 4
// tv2 loop: [i0, i1/4, 4]
//
// All IDs: [i0, i1, i1/4, 4]

EXPECT_EQ(tv2->getInitialLoopDomain(), tv2_initial_loop_domain);

// Because the split only uses the extra ID, getExprsBetween from
// the loop domain to the logical domain does not traverse the
// split, just returning an empty vector.
EXPECT_TRUE(
IRBFS::getExprsBetween(
{tv2->getLoopDomain().begin(), tv2->getLoopDomain().end()},
{tv2->getLogicalDomain().begin(), tv2->getLogicalDomain().end()},
false)
.empty());

// From the initial loop to the current loop should find the split expr
auto exprs_between = IRBFS::getExprsBetween(
{tv2->getInitialLoopDomain().begin(), tv2->getInitialLoopDomain().end()},
{tv2->getLoopDomain().begin(), tv2->getLoopDomain().end()},
false);
EXPECT_EQ(exprs_between.size(), 1);
EXPECT_EQ(exprs_between.front().first, tv2_split);

// The initial loop domain and the current loop domain should be
// reachable to each other with no redundancy
auto tv2_loop_domain_comparison_results = ir_utils::compareDomains(
tv2->getInitialLoopDomain(), tv2->getLoopDomain());
EXPECT_FALSE(tv2_loop_domain_comparison_results.dom0_has_unreachable_ids);
EXPECT_FALSE(tv2_loop_domain_comparison_results.dom1_has_unreachable_ids);

// Make sure allIDs finds all the IDs including the extra IDs
std::unordered_set<IterDomain*> tv2_all_ids_ref;
tv2_all_ids_ref.insert(
tv2->getLogicalDomain().begin(), tv2->getLogicalDomain().end());
tv2_all_ids_ref.insert(
tv2->getInitialLoopDomain().begin(), tv2->getInitialLoopDomain().end());
tv2_all_ids_ref.insert(
tv2->getLoopDomain().begin(), tv2->getLoopDomain().end());

auto tv2_all_ids = tv2->domain()->allIDs();
std::unordered_set<IterDomain*> tv2_all_id_set(
tv2_all_ids.begin(), tv2_all_ids.end());

EXPECT_EQ(tv2_all_id_set, tv2_all_ids_ref);
}

// Repro for issue #236 (https://github.com/NVIDIA/Fuser/issues/236)
TEST_F(NVFuserTest, DoublePrecisionNorm_CUDA) {
auto fusion = std::make_unique<Fusion>();
Expand Down
Loading