Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lowering vectorized pad #3261

Merged
merged 51 commits into from
Nov 5, 2024
Merged

Lowering vectorized pad #3261

merged 51 commits into from
Nov 5, 2024

Conversation

jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented Oct 23, 2024

Added support for lowering TernaryOp:where with vectorization factor.

i.e.

predicate
  ? loadGlobalToLocal<...>(&dst[0], &src[i_src])
  : dst.set(0.0f) 

Currently this can only be done via manual scheduling. The follow up PR on vectorization analysis will make this automatically applied in PR #3321

csrc/codegen.cpp Outdated
@@ -402,6 +402,52 @@ class CudaKernelGenerator : private kir::ConstIrVisitor {
}
}

void generateVectorizedLdSt(Val* in, Val* out, CacheOp cache_op) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mechanical change, this is lifted from void handle(const LoadStoreOp* ldst) final, so it can be shared with TernaryOp handling.

@jjsjann123 jjsjann123 marked this pull request as ready for review November 4, 2024 16:50
@jjsjann123
Copy link
Collaborator Author

!test

@jjsjann123
Copy link
Collaborator Author

!test

@@ -4041,4 +4041,37 @@ TEST_F(ResizeTest, SliceSliceConcatConcat) {
NVF_CHECK(ref.equal(cg_outputs[0]));
}

// manual scheduling that should have vectorized load on padded inputs.
TEST_F(ResizeTest, VectorizePadLowering) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a test for vectorizing where without using pad?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call. almost forgot that we have where directly 🤕

csrc/codegen.cpp Outdated Show resolved Hide resolved
csrc/codegen.cpp Outdated
@@ -1001,6 +1051,50 @@ class CudaKernelGenerator : private kir::ConstIrVisitor {
}

void handle(const TernaryOp* top) final {
// Get vectorization information
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments about the expectation? IIUC, only in2 is allowed to be vectorized, but technically speaking, it should be possible to have vectorized loads in both in2 and in3, right? Not sure if it's worthwhile to allow that as well, although the required change seems minimal.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can have in2 / in3 as TensorViews, I'm trying to add that since @zasdfgbnm mentioned about having a where test.

csrc/codegen.cpp Outdated Show resolved Hide resolved
@jjsjann123
Copy link
Collaborator Author

!test

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jjsjann123 jjsjann123 merged commit 62bd3b5 into main Nov 5, 2024
47 checks passed
@jjsjann123 jjsjann123 deleted the jjsjann123/resize_vec branch November 5, 2024 16:51
@jjsjann123 jjsjann123 restored the jjsjann123/resize_vec branch November 12, 2024 20:08
jjsjann123 added a commit that referenced this pull request Nov 13, 2024
Adding **conditional** support of reszie in vectorization analysis. This
PR allows vectorized load on `PadOp` directly without using cache load.
This PR improves performance of generated kernel.

What's in this PR:
1. Add propagation rule for resize in vectorization analysis. The
propagation rule works as:
i. For supported resize: a). project the resize op to the frontier and
clear `(frontier.begin(), resize_position)`; b). add projected extent of
the new resize op as `gcd(id_from, resize_op->leftExpand(),
resize_op->rightExpand)`
ii. For unsupported resize: clear `[frontier.begin(), resize_position]`;
no behavior change.

2. updating TensorView::cacheAfter to opt-in a set of uses to cache
while leaving other uses unchanged. Necessary for cases where inputs are
used by PadOp as well as other operation that relies on cached load for
vectorization.

Follow up to #3261.
Work for supporting rope performance. [design
doc](https://docs.google.com/document/d/1tafRMNIXMmHlIGAiNlaPkYp6mZAzJ2Rh_NtARHbmYNA/edit?disco=AAABYEnV_ZY):

---------

Co-authored-by: Naoya Maruyama <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants