Adding resize(PadOp) vectorization analysis #3321

jjsjann123 · 2024-10-31T15:26:12Z

Adding conditional support of reszie in vectorization analysis. This PR allows vectorized load on PadOp directly without using cache load. This PR improves performance of generated kernel.

What's in this PR:

Add propagation rule for resize in vectorization analysis. The propagation rule works as:
i. For supported resize: a). project the resize op to the frontier and clear (frontier.begin(), resize_position); b). add projected extent of the new resize op as gcd(id_from, resize_op->leftExpand(), resize_op->rightExpand)
ii. For unsupported resize: clear [frontier.begin(), resize_position]; no behavior change.
updating TensorView::cacheAfter to opt-in a set of uses to cache while leaving other uses unchanged. Necessary for cases where inputs are used by PadOp as well as other operation that relies on cached load for vectorization.

Follow up to #3261.
Work for supporting rope performance. design doc:

This reverts commit d0addc4.

csrc/tensor_view.cpp

csrc/preseg_passes/move_pad.cpp

tests/cpp/test_resize.cpp

csrc/scheduler/vectorize_helper.cpp

Co-authored-by: Naoya Maruyama <[email protected]>

naoyam

It overall looks good. Just would like a few things I commented about to get addressed.

naoyam · 2024-11-08T00:28:08Z

!test --pybench

naoyam · 2024-11-08T00:28:39Z

Initiated testing with python benchmarks just in case.

jjsjann123 · 2024-11-08T00:37:55Z

Thanks, I'll address the issues you brought up as well as running through some real size problem so we get a taste of the perf impact. 🙇

jjsjann123 · 2024-11-08T23:36:16Z

!test --pybench

jjsjann123 · 2024-11-09T00:00:35Z

!test --pybench

jjsjann123 · 2024-11-09T21:11:33Z

Did a quick look at the perf. The end-2-end time looks very noisy. I'm a bit unsure about my measuring script so instead just did a nsys on the kernel time.

On A100 80GB PCIe, peak bandwidth is 2TB/s.

At bsz 256 with bf16, we are looking at IO size of roughtly 512MB (256 * 1024 * 16 * 32 * 2 * 2 + 1024 * 8 * 2 * 2) / 1024 / 1024. main branch kernel time is roughly 510ms, achieving ~ 1TB/s. With vectorization, we boost it to 368ms ~ 1.39TB/s.
At bsz 516 with bf16, main gives 1013ms ~ 1.01TB/s. With vectorization 732ms ~ 1.40TB/s

Something like this vvv.

import torch
import thunder
def rope_one_entry(x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor, rope_n_elem: int) -> torch.Tensor:
    x_rope = x[..., : rope_n_elem]
    x1 = x_rope[..., : rope_n_elem // 2]  # (B, nh, T, hs/2)
    x2 = x_rope[..., rope_n_elem // 2 :]  # (B, nh, T, hs/2)
    rotated = torch.cat((-x2, x1), dim=-1)  # (B, nh, T, hs)
    roped = (x_rope * cos) + (rotated * sin)
    [roped.to](http://roped.to/)(dtype=x.dtype)
    return torch.cat((roped, x[..., rope_n_elem :]), dim=-1)
dtype = torch.bfloat16
device = "cuda"
bsz = 256
block_size = 1024
n_head = 16
head_size = 32
n_query_groups = 4
rope_n_elem = 8
WARMPU_ITER = 5
MEASURE_ITER = 20
cos = torch.randn(block_size, rope_n_elem, device=device, dtype=dtype)
sin = torch.randn(block_size, rope_n_elem, device=device, dtype=dtype)
thunder_rope_one = thunder.jit(rope_one_entry, executors=("nvfuser",) , nv_enable_bookend=False)
x = torch.randn([bsz, n_head, block_size, head_size], device=device, dtype=dtype)
# ref full run
o_ref = rope_one_entry(x.float(), cos.float(), sin.float(), rope_n_elem).to(dtype=dtype)
l2_clear_buffer = torch.empty(80, 1024, 1024, dtype=torch.float, device="cuda")
# warm up
for i in range(WARMPU_ITER):
    o = thunder_rope_one(x, cos, sin, rope_n_elem)
# measurement
for i in range(MEASURE_ITER):
    l2_clear_buffer.zero_()
    o = thunder_rope_one(x, cos, sin, rope_n_elem)
assert(o.allclose(o_ref))

jjsjann123 · 2024-11-09T21:12:52Z

I think review comments have been addressed as well. CI was green before my benchmark. Ready for a final review.

jjsjann123 · 2024-11-09T21:13:05Z

!test --pybench

jjsjann123 · 2024-11-12T16:45:08Z

build failure seems to be flaky. I re-started the CI on that one and it passed internally.

Unfortunately it didn't update the github status here. Not a big issue but cc'ing @xwang233 in case this is something you are not aware of.

naoyam · 2024-11-12T17:16:23Z

csrc/tensor_view.cpp

+      NVF_ERROR(
+          unique_uses.count(use),
+          "cached_uses is not among the use of the TensorView");
+      target_uses.push_back(use);


Isn't this ordering still non deterministic? I think the parameter itself needs to be deterministically ordered.

Good catch. Sorry I missed the cached_uses. Let me give it another try.

naoyam · 2024-11-12T19:37:43Z

csrc/tensor_view.cpp

+  } else {
+    // avoid non-determinism and ensure unique
+    std::unordered_set<Expr*> unique_uses;
+    auto this_uses = uses();


Is it possible to have duplicates?

I don't think so. since we have uses_ as private and addUse does skip duplicates

Fuser/csrc/ir/base_nodes.cpp

Lines 96 to 102 in a5022da

bool Val::addUse(Expr* expr) {

if (std::find(uses_.begin(), uses_.end(), expr) == uses_.end()) {

uses_.push_back(expr);

return true;

}

return false;

}

But I'm a bit scared by just leaving it not checked, since it's only a std::vector and it's up to the implementation to change that.

If so, let's make it an error if a duplicate is found.

switched. I'll kick off the CI to see if there's any surprises. 🤞

jjsjann123 · 2024-11-12T20:11:58Z

!test --pybench

naoyam

LGTM

jjsjann123 and others added 30 commits September 2, 2024 04:27

relaxing check

8f9708f

allow cache on inputs for pad

54826aa

Merge remote-tracking branch 'origin/main' into jjsjann123/resize_vec

e54938c

cpp example

2bc3c7a

Merge branch 'jjsjann123/pad_vec' into jjsjann123/resize_vec

d04e8c3

reverting earlier changes

d0addc4

Revert "reverting earlier changes"

490fdbe

This reverts commit d0addc4.

cherry-pick my revert

51c3022

Merge remote-tracking branch 'origin/main' into jjsjann123/resize_vec

1158ef0

debug print

fdc6a9a

Merge remote-tracking branch 'origin/main' into jjsjann123/resize_vec

9a6c03a

removing comments

a9d16ce

removing assert

3401119

Merge remote-tracking branch 'origin/main' into jjsjann123/resize_vec

5d05284

patching test

b6587ee

Merge remote-tracking branch 'origin/main' into jjsjann123/resize_vec

28decac

Merge remote-tracking branch 'origin/main' into HEAD

3e53feb

fixing test

ad61ecb

fixing

a8edc56

fixing test

9cdeb64

does this work to replace Ternary(where) with IfThenElse

09a2aee

fixing build

895d0bf

removing print

7a15e22

restore lower to ternary:where; restore vectorization on tests

a6e8fb1

testing water

fe0f263

fixing syntax

baa7b09

now it's functional

ca5ced1

better formatting on printed code

e0492d3

adding a tab

b528429

supporting local memory

a23e010

jjsjann123 requested review from zasdfgbnm and jacobhinkle November 6, 2024 18:47

naoyam reviewed Nov 7, 2024

View reviewed changes

Update tests/cpp/test_resize.cpp

e8033c1

Co-authored-by: Naoya Maruyama <[email protected]>

naoyam reviewed Nov 8, 2024

View reviewed changes

jjsjann123 and others added 2 commits November 8, 2024 08:41

Merge remote-tracking branch 'origin/main' into HEAD

6848dad

addressing review comment; adding more tests

5299327

jjsjann123 added 2 commits November 8, 2024 15:48

errr

48cafba

fix

61d7c3f

jjsjann123 requested a review from naoyam November 9, 2024 21:12

Merge branch 'main' into jjsjann123/pad_vec_analysis

688b99c

naoyam reviewed Nov 12, 2024

View reviewed changes

jjsjann123 added 3 commits November 12, 2024 10:57

Merge branch 'main' into jjsjann123/pad_vec_analysis

67beb5c

move from unordered_set to vector

d64ea9c

update const

d9b3cc8

naoyam reviewed Nov 12, 2024

View reviewed changes

switching to assert

a4743bf

naoyam approved these changes Nov 12, 2024

View reviewed changes

jjsjann123 merged commit 2fb5539 into main Nov 13, 2024
52 checks passed

jjsjann123 deleted the jjsjann123/pad_vec_analysis branch November 13, 2024 04:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding resize(PadOp) vectorization analysis #3321

Adding resize(PadOp) vectorization analysis #3321

jjsjann123 commented Oct 31, 2024 •

edited

Loading

naoyam left a comment

naoyam commented Nov 8, 2024

naoyam commented Nov 8, 2024

jjsjann123 commented Nov 8, 2024

jjsjann123 commented Nov 8, 2024

jjsjann123 commented Nov 9, 2024

jjsjann123 commented Nov 9, 2024

jjsjann123 commented Nov 9, 2024

jjsjann123 commented Nov 9, 2024

jjsjann123 commented Nov 12, 2024

naoyam Nov 12, 2024

jjsjann123 Nov 12, 2024

naoyam Nov 12, 2024

jjsjann123 Nov 12, 2024 •

edited

Loading

naoyam Nov 12, 2024

jjsjann123 Nov 12, 2024

jjsjann123 commented Nov 12, 2024

naoyam left a comment

	bool Val::addUse(Expr* expr) {
	if (std::find(uses_.begin(), uses_.end(), expr) == uses_.end()) {
	uses_.push_back(expr);
	return true;
	}
	return false;
	}

Adding resize(PadOp) vectorization analysis #3321

Adding resize(PadOp) vectorization analysis #3321

Conversation

jjsjann123 commented Oct 31, 2024 • edited Loading

naoyam left a comment

Choose a reason for hiding this comment

naoyam commented Nov 8, 2024

naoyam commented Nov 8, 2024

jjsjann123 commented Nov 8, 2024

jjsjann123 commented Nov 8, 2024

jjsjann123 commented Nov 9, 2024

jjsjann123 commented Nov 9, 2024

jjsjann123 commented Nov 9, 2024

jjsjann123 commented Nov 9, 2024

jjsjann123 commented Nov 12, 2024

naoyam Nov 12, 2024

Choose a reason for hiding this comment

jjsjann123 Nov 12, 2024

Choose a reason for hiding this comment

naoyam Nov 12, 2024

Choose a reason for hiding this comment

jjsjann123 Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

naoyam Nov 12, 2024

Choose a reason for hiding this comment

jjsjann123 Nov 12, 2024

Choose a reason for hiding this comment

jjsjann123 commented Nov 12, 2024

naoyam left a comment

Choose a reason for hiding this comment

jjsjann123 commented Oct 31, 2024 •

edited

Loading

jjsjann123 Nov 12, 2024 •

edited

Loading