Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LV] Ignore some costs when loop gets fully unrolled #106699

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

igogo-x86
Copy link
Contributor

When fixed-width VF equals the number of iterations, comparison instruction and induction operation will be DCEed later. Ignoring the costs of these instructions improves the cost model.

I have one benchmark that significantly improves when vectorised with fixed VF instead of scalable on AArch64 due to full unrolling. But I am not sure how to ignore the costs properly. This patch, as it is, makes new and legacy cost models out of sync and compiler fails with assertion error on this line:

assert((BestFactor.Width == LegacyVF.Width ||
 planContainsAdditionalSimplifications(getPlanFor(BestFactor.Width),
 BestFactor.Width, CostCtx,
 OrigLoop, CM)) &&
 " VPlan cost model and legacy cost model disagreed");

Identifying variables to ignore in LoopVectorizationCostModel::getInstructionCost is much more challenging.

Is the approach shown in the patch ok, and can I wait for the legacy cost model to be removed? If not, how can it be done differently?

@llvmbot
Copy link
Member

llvmbot commented Aug 30, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Igor Kirillov (igogo-x86)

Changes

When fixed-width VF equals the number of iterations, comparison instruction and induction operation will be DCEed later. Ignoring the costs of these instructions improves the cost model.

I have one benchmark that significantly improves when vectorised with fixed VF instead of scalable on AArch64 due to full unrolling. But I am not sure how to ignore the costs properly. This patch, as it is, makes new and legacy cost models out of sync and compiler fails with assertion error on this line:

assert((BestFactor.Width == LegacyVF.Width ||
 planContainsAdditionalSimplifications(getPlanFor(BestFactor.Width),
 BestFactor.Width, CostCtx,
 OrigLoop, CM)) &&
 " VPlan cost model and legacy cost model disagreed");

Identifying variables to ignore in LoopVectorizationCostModel::getInstructionCost is much more challenging.

Is the approach shown in the patch ok, and can I wait for the legacy cost model to be removed? If not, how can it be done differently?


Full diff: https://github.com/llvm/llvm-project/pull/106699.diff

1 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+20)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 6babfd1eee9108..7dc9a60cb0f034 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7112,6 +7112,26 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
         continue;
       IVInsts.push_back(CI);
     }
+
+    // If the given VF loop gets fully unrolled, ignore the costs of comparison
+    // and increment instruction, as they'll get simplified away
+    auto TC = CM.PSE.getSE()->getSmallConstantTripCount(OrigLoop);
+    auto *Cmp = OrigLoop->getLatchCmpInst();
+    if (Cmp && VF.isFixed() && VF.getFixedValue() == TC) {
+      CostCtx.SkipCostComputation.insert(Cmp);
+      for (Instruction *IVInst : IVInsts) {
+        bool IsSimplifiedAway = true;
+        for (auto *UIV : IVInst->users()) {
+          if (!Legal->isInductionVariable(UIV) && UIV != Cmp) {
+            IsSimplifiedAway = false;
+            break;
+          }
+        }
+        if (IsSimplifiedAway)
+          CostCtx.SkipCostComputation.insert(IVInst);
+      }
+    }
+
     for (Instruction *IVInst : IVInsts) {
       if (CostCtx.skipCostComputation(IVInst, VF.isVector()))
         continue;

@igogo-x86
Copy link
Contributor Author

Ping

@fhahn
Copy link
Contributor

fhahn commented Sep 12, 2024

IIUC this should catch cases where the vector loop only executes a single iteration, right?

Ideally we would first remove the redundant compare-and-branch and induction recipes, and then rely on the VPlan-based cost model to consider this difference.

For the former, I just shared #108378 to do that. For the latter, I think we first need to change the cost of the exit condition to be computed by the VPlan-based cost model, instead of pre-computing it.

As you mentioned, this is leading to divergences between legacy and VPlan-based cost model. Ideally we would first migrate computing costs for all recipes to the VPlan-based cost model, ensuring the don't regress vs the legacy cost model, which is still in the works.

@igogo-x86
Copy link
Contributor Author

As far as I understand, optimizeForVFAndUF runs after the cost model is calculated and taken into account to choose the best vectorisation factor:

  VPlanTransforms::optimizeForVFAndUF(BestVPlan, BestVF, BestUF, PSE);

Is optimizeForVFAndUF eventually going to be run for all VFs before the cost model decision? If so, now VPlan implies all possible VF. So, will it generate new VPLans if optimizeForVFAndUF changes something?

@igogo-x86 igogo-x86 force-pushed the lv-ignore-some-costs-fully-unrolled branch from 9a5cbbc to 5bff837 Compare October 21, 2024 14:05
@igogo-x86 igogo-x86 changed the title [WIP][LV] Ignore some costs when loop gets fully unrolled [LV] Ignore some costs when loop gets fully unrolled Oct 21, 2024
@igogo-x86
Copy link
Contributor Author

@fhahn I updated the patch to synchronise both models. Also, ping on my previous questions

Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is good, i.e. teach the vectoriser about exit costs when the branch and compare will be removed. However, I think you've assumed that the induction variables have no other uses and so have almost certainly underestimated the costs in some loops. Also, I noticed that this patch doesn't remove the branch cost either. It's possible that removing the branch and comparison costs are enough to get what you want? It might be easier than analysing the users of all the incremented induction variables.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
SmallPtrSet<const Value *, 16> ValuesToIgnoreForVF;
auto TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
auto *Cmp = TheLoop->getLatchCmpInst();
if (Cmp && TC == VF.getKnownMinValue()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment you're also including scalable VFs here and I think the only way we can remove the compare, branch and increment for a scalable VF is if:

  1. We are not tail-folding. Have you checked that we still remove the branch and compare for scalable VFs?
  2. We are tail-folding. In this case you can actually check if TC <= VF.getKnownMinValue(). If you still want to enable this optimisation for tail-folding too then it makes sense to try <= instead for fixed-width and scalable VFs and see if we also remove the branch and compare.

It might be worth pulling out some of the common code in both expectedCost and precomputeCosts to move into a static function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out scalable VF isn't that simple. Branch, next-iteration-induction, and comparison are simplified but only with tail-folding. So, I assume we have to keep only fixed-width FV, as I don't see how I can check the type of prediction here. Branch cost is zero, so I don't do anything about it.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
@igogo-x86 igogo-x86 force-pushed the lv-ignore-some-costs-fully-unrolled branch from 5bff837 to a9b9cad Compare November 13, 2024 14:39
Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing previous comments @igogo-x86! I have a few more comments - I'm most worried about the changes to the RISCV tail-folding tests.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
entry:
br label %for.body

for.cond.cleanup: ; preds = %for.body
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename the exit block name here to something more readable? For example, for.cond.cleanup -> ext and it would be nice to move this after the loop, i.e.

entry:
  br label %for.body

for.body:
...

exit:
...

Also, I think in the exit block you should be able to write:

exit:
  ret i64 %add

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it would be good to pre-commit the test or move it into it's own commit within this branch. That way we can see the effect your code changes have on the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll pre-commit them once I'll be more sure that the patch will go through :)

; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP7]]
; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[TMP8]], i32 0
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 4 x i8> @llvm.masked.load.nxv4i8.p0(ptr [[TMP9]], i32 1, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i8> poison)
; CHECK-NEXT: [[TMP10:%.*]] = shl <vscale x 4 x i8> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 4 x i8> insertelement (<vscale x 4 x i8> poison, i8 1, i64 0), <vscale x 4 x i8> poison, <vscale x 4 x i32> zeroinitializer)
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <8 x i8> @llvm.masked.load.v8i8.p0(ptr [[TMP9]], i32 1, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i8> poison)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't have expected this test to be changed by your patch. I think your code changes need an extra check that we're not tail-folding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's also bothering me. But where I'm adding this code, I can't see how to understand if scalable vectorisation is available and whether it would be tail-folded or not.

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, optimizeForVFAndUF runs after the cost model is calculated and taken into account to choose the best vectorisation factor:

  VPlanTransforms::optimizeForVFAndUF(BestVPlan, BestVF, BestUF, PSE);

Is optimizeForVFAndUF eventually going to be run for all VFs before the cost model decision? If so, now VPlan implies all possible VF. So, will it generate new VPLans if optimizeForVFAndUF changes something?

Eventually yes, eventually there should be no need to special case this. VPlans are valid for a range, that could be split. I guess for now doing it as workaround in precomputeCosts seems OK, even though it is not really intended for that (ideally we only remove stuff from there at this point). Could you add a comment to make sure this is not seen as encouragement to add more code that's not directly related to the transition to the VPlan-based cost model?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
@@ -7224,6 +7259,14 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
continue;
IVInsts.push_back(CI);
}

// If with the given VF loop gets fully unrolled, ignore the costs of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If with the given VF loop gets fully unrolled, ignore the costs of
// If the vector loop gets executed exactly once with the given VF ....

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
Comment on lines 2673 to 2674
if (!Cmp)
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the early return here, IV's could still be simplified?

Would be good to have a test case for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See test_two_ivs test

Comment on lines 2676 to 2666
for (const auto &[IV, IndDesc] : IL) {
// Get next iteration value of the induction variable
Instruction *IVInst =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we completely ignore any IV and update instruction, if the start is a constant? All uses of the update will be replaced by a constant for fixed VFs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which instructions could that be? We have only PHINodes in the IV chain unprocessed, but they already have zero cost

// Check that this value used only to exit the loop
for (auto *UIV : IVInst->users()) {
if (UIV != IV && UIV != Cmp) {
IsSimplifiedAway = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to have a test case for this

Copy link
Contributor Author

@igogo-x86 igogo-x86 Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See test_external_iv_user test

@igogo-x86 igogo-x86 force-pushed the lv-ignore-some-costs-fully-unrolled branch from a9b9cad to fbb939a Compare November 15, 2024 16:42
When VF equals the number of iterations, comparison instruction
and induction operation will be DCEed later. Ignoring the costs of these
instructions improves the cost model.
* Fixing comments
* Adding more tests
* Remove cmp latch presence requirements
@igogo-x86 igogo-x86 force-pushed the lv-ignore-some-costs-fully-unrolled branch from fbb939a to 166f376 Compare November 19, 2024 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants