Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding resize(PadOp) vectorization analysis #3321

Merged
merged 100 commits into from
Nov 13, 2024
Merged

Conversation

jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented Oct 31, 2024

Adding conditional support of reszie in vectorization analysis. This PR allows vectorized load on PadOp directly without using cache load. This PR improves performance of generated kernel.

What's in this PR:

  1. Add propagation rule for resize in vectorization analysis. The propagation rule works as:
    i. For supported resize: a). project the resize op to the frontier and clear (frontier.begin(), resize_position); b). add projected extent of the new resize op as gcd(id_from, resize_op->leftExpand(), resize_op->rightExpand)
    ii. For unsupported resize: clear [frontier.begin(), resize_position]; no behavior change.

  2. updating TensorView::cacheAfter to opt-in a set of uses to cache while leaving other uses unchanged. Necessary for cases where inputs are used by PadOp as well as other operation that relies on cached load for vectorization.

Follow up to #3261.
Work for supporting rope performance. design doc:

csrc/tensor_view.cpp Outdated Show resolved Hide resolved
csrc/preseg_passes/move_pad.cpp Show resolved Hide resolved
tests/cpp/test_resize.cpp Outdated Show resolved Hide resolved
tests/cpp/test_resize.cpp Outdated Show resolved Hide resolved
csrc/scheduler/vectorize_helper.cpp Show resolved Hide resolved
csrc/scheduler/vectorize_helper.cpp Outdated Show resolved Hide resolved
csrc/scheduler/vectorize_helper.cpp Outdated Show resolved Hide resolved
csrc/scheduler/vectorize_helper.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It overall looks good. Just would like a few things I commented about to get addressed.

@naoyam
Copy link
Collaborator

naoyam commented Nov 8, 2024

!test --pybench

@naoyam
Copy link
Collaborator

naoyam commented Nov 8, 2024

Initiated testing with python benchmarks just in case.

@jjsjann123
Copy link
Collaborator Author

Thanks, I'll address the issues you brought up as well as running through some real size problem so we get a taste of the perf impact. 🙇

@jjsjann123
Copy link
Collaborator Author

!test --pybench

@jjsjann123
Copy link
Collaborator Author

!test --pybench

@jjsjann123
Copy link
Collaborator Author

Did a quick look at the perf. The end-2-end time looks very noisy. I'm a bit unsure about my measuring script so instead just did a nsys on the kernel time.

On A100 80GB PCIe, peak bandwidth is 2TB/s.

  • At bsz 256 with bf16, we are looking at IO size of roughtly 512MB (256 * 1024 * 16 * 32 * 2 * 2 + 1024 * 8 * 2 * 2) / 1024 / 1024. main branch kernel time is roughly 510ms, achieving ~ 1TB/s. With vectorization, we boost it to 368ms ~ 1.39TB/s.
  • At bsz 516 with bf16, main gives 1013ms ~ 1.01TB/s. With vectorization 732ms ~ 1.40TB/s

Something like this vvv.

import torch
import thunder
def rope_one_entry(x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor, rope_n_elem: int) -> torch.Tensor:
    x_rope = x[..., : rope_n_elem]
    x1 = x_rope[..., : rope_n_elem // 2]  # (B, nh, T, hs/2)
    x2 = x_rope[..., rope_n_elem // 2 :]  # (B, nh, T, hs/2)
    rotated = torch.cat((-x2, x1), dim=-1)  # (B, nh, T, hs)
    roped = (x_rope * cos) + (rotated * sin)
    [roped.to](http://roped.to/)(dtype=x.dtype)
    return torch.cat((roped, x[..., rope_n_elem :]), dim=-1)
dtype = torch.bfloat16
device = "cuda"
bsz = 256
block_size = 1024
n_head = 16
head_size = 32
n_query_groups = 4
rope_n_elem = 8
WARMPU_ITER = 5
MEASURE_ITER = 20
cos = torch.randn(block_size, rope_n_elem, device=device, dtype=dtype)
sin = torch.randn(block_size, rope_n_elem, device=device, dtype=dtype)
thunder_rope_one = thunder.jit(rope_one_entry, executors=("nvfuser",) , nv_enable_bookend=False)
x = torch.randn([bsz, n_head, block_size, head_size], device=device, dtype=dtype)
# ref full run
o_ref = rope_one_entry(x.float(), cos.float(), sin.float(), rope_n_elem).to(dtype=dtype)
l2_clear_buffer = torch.empty(80, 1024, 1024, dtype=torch.float, device="cuda")
# warm up
for i in range(WARMPU_ITER):
    o = thunder_rope_one(x, cos, sin, rope_n_elem)
# measurement
for i in range(MEASURE_ITER):
    l2_clear_buffer.zero_()
    o = thunder_rope_one(x, cos, sin, rope_n_elem)
assert(o.allclose(o_ref))

@jjsjann123 jjsjann123 requested a review from naoyam November 9, 2024 21:12
@jjsjann123
Copy link
Collaborator Author

I think review comments have been addressed as well. CI was green before my benchmark. Ready for a final review.

@jjsjann123
Copy link
Collaborator Author

!test --pybench

@jjsjann123
Copy link
Collaborator Author

build failure seems to be flaky. I re-started the CI on that one and it passed internally.

Unfortunately it didn't update the github status here. Not a big issue but cc'ing @xwang233 in case this is something you are not aware of.

NVF_ERROR(
unique_uses.count(use),
"cached_uses is not among the use of the TensorView");
target_uses.push_back(use);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this ordering still non deterministic? I think the parameter itself needs to be deterministically ordered.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Sorry I missed the cached_uses. Let me give it another try.

} else {
// avoid non-determinism and ensure unique
std::unordered_set<Expr*> unique_uses;
auto this_uses = uses();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have duplicates?

Copy link
Collaborator Author

@jjsjann123 jjsjann123 Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. since we have uses_ as private and addUse does skip duplicates

bool Val::addUse(Expr* expr) {
if (std::find(uses_.begin(), uses_.end(), expr) == uses_.end()) {
uses_.push_back(expr);
return true;
}
return false;
}

But I'm a bit scared by just leaving it not checked, since it's only a std::vector and it's up to the implementation to change that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, let's make it an error if a duplicate is found.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched. I'll kick off the CI to see if there's any surprises. 🤞

@jjsjann123
Copy link
Collaborator Author

!test --pybench

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jjsjann123 jjsjann123 merged commit 2fb5539 into main Nov 13, 2024
52 checks passed
@jjsjann123 jjsjann123 deleted the jjsjann123/pad_vec_analysis branch November 13, 2024 04:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants