Validating `CrossEntropyLoss` Performance #278

kevinstephano · 2023-05-04T01:08:42Z

I made this code snippet to show perf of CrossEntropyLoss.

import torch

class MyLoss(torch.nn.Module):
    def __init__(self):
        super(MyLoss, self).__init__()
        self.loss = torch.nn.CrossEntropyLoss()

    def forward(self, inputs, targets):
        out = self.loss(inputs, targets)
        return out

inputs = torch.randn(8192, 32768, device='cuda')
targets = torch.randint(32767, (8192,), device='cuda')

model = torch.compile(MyLoss())

for _ in range(5):
    out = model(inputs, targets)

Test command:

nsys nvprof --print-gpu-trace python my_loss.py

Sample output on A100:

Tensor Sizes:
inputs = [8192, 32768]
targets = [8192[

Kernel1: 1.222ms
Kernel2: 37.1us

4202242471        1222152    1059  8192     1     1   256     1     1       39         0.000         0.001                                                     NVIDIA A100 80GB PCIe (0)    1     7  triton__0d1d2d3d4d                                                                                  
4203465679          37121    1072     1     1     1   256     1     1      184         0.000         0.008                                                     NVIDIA A100 80GB PCIe (0)    1     7  triton__0d1d2d3d4d56d

The text was updated successfully, but these errors were encountered:

csarofeen · 2023-05-08T14:08:42Z

What's the effective bandwidth of the kernels?

naoyam · 2023-05-08T17:08:50Z

Kernel1 is around 880 GB/s. The other one is just a few GB/s and is negligible.

For #278, our schedulers currently can't fuse them into a single kernel, so a segmentation needs to happen. Currently, it's segmented between softmax and take_along_axis, just because of the ordering of the segmenter. However, we want the take_along_axis op to be fused together with the preceding softmax since then the temporary output from the first segment would be much smaller, reducing gmem access overhead. In the case of `[8192, 32768]`, the (logical) I/O cost would be `1 / 32768`. This PR introduces a simple mechanism to allow preferred fusions in the segmentation steps. Currently, there's only preference of select-like ops with producers. "Select-like" here also includes index_select and torch_gather to size-one domains. In those ops, it's guaranteed that the size of the consumer tensor is no larger than the lookup tensor, so it makes sense to fuse them with producers. Currently, it's only tested with the cross-entropy loss case. The overall segmentation algorithm would need to go through significant refactoring, so I don't think making this interface super robust is worth doing at this moment, and it's likely to be redesigned. For now, this is very important for the cross-entropy performance.

kevinstephano · 2024-10-30T16:20:33Z

Closing, old.

kevinstephano assigned naoyam, jjsjann123 and kevinstephano May 4, 2023

jjsjann123 mentioned this issue May 4, 2023

[Feature Request] Expanding support on gather-like operations to avoid Kernel segmentation #156

Closed

5 tasks

This was referenced May 10, 2023

Optimize fmax with NAN #319

Open

Promote fusing select-like ops with producers #332

Merged

kevinstephano closed this as completed Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validating `CrossEntropyLoss` Performance #278

Validating `CrossEntropyLoss` Performance #278

kevinstephano commented May 4, 2023 •

edited

Loading

csarofeen commented May 8, 2023

naoyam commented May 8, 2023

kevinstephano commented Oct 30, 2024

Validating CrossEntropyLoss Performance #278

Validating CrossEntropyLoss Performance #278

Comments

kevinstephano commented May 4, 2023 • edited Loading

csarofeen commented May 8, 2023

naoyam commented May 8, 2023

kevinstephano commented Oct 30, 2024

Validating `CrossEntropyLoss` Performance #278

Validating `CrossEntropyLoss` Performance #278

kevinstephano commented May 4, 2023 •

edited

Loading