-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validating CrossEntropyLoss
Performance
#278
Comments
Closed
5 tasks
What's the effective bandwidth of the kernels? |
Kernel1 is around 880 GB/s. The other one is just a few GB/s and is negligible. |
This was referenced May 10, 2023
naoyam
added a commit
that referenced
this issue
May 12, 2023
For #278, our schedulers currently can't fuse them into a single kernel, so a segmentation needs to happen. Currently, it's segmented between softmax and take_along_axis, just because of the ordering of the segmenter. However, we want the take_along_axis op to be fused together with the preceding softmax since then the temporary output from the first segment would be much smaller, reducing gmem access overhead. In the case of `[8192, 32768]`, the (logical) I/O cost would be `1 / 32768`. This PR introduces a simple mechanism to allow preferred fusions in the segmentation steps. Currently, there's only preference of select-like ops with producers. "Select-like" here also includes index_select and torch_gather to size-one domains. In those ops, it's guaranteed that the size of the consumer tensor is no larger than the lookup tensor, so it makes sense to fuse them with producers. Currently, it's only tested with the cross-entropy loss case. The overall segmentation algorithm would need to go through significant refactoring, so I don't think making this interface super robust is worth doing at this moment, and it's likely to be redesigned. For now, this is very important for the cross-entropy performance.
Closing, old. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I made this code snippet to show perf of
CrossEntropyLoss
.Test command:
Sample output on A100:
Tensor Sizes:
inputs =
[8192, 32768]
targets =
[8192[
Kernel1:
1.222ms
Kernel2:
37.1us
The text was updated successfully, but these errors were encountered: