[Operator] Init NLL_LOSS #269

GwokHiujin · 2024-10-30T09:36:18Z

A basic implementation of NLL_LOSS has been pushed.

Based on the performance testing results summarized earlier, we believe that using the gather operation would lead to a more efficient implementation (by observing the output results of latency, it seems this is also how torch does it), and we will push forward with this optimization.

tongxin

LGTM

tongxin · 2024-11-27T02:06:17Z

benchmark/test_reduction_perf.py

@@ -135,6 +141,13 @@ def cumsum_input_fn(shape, cur_dtype, device):
            FLOAT_DTYPES + INT_DTYPES,
            marks=pytest.mark.cumsum,
        ),
+        pytest.param(
+            "nll_loss",
+            torch.nn.NLLLoss,


NLLLoss is a class. Can we use it as the reference function?

Indeed. I've updated it to torch.nn.functional.nll_loss.

src/flag_gems/ops/nllloss.py

tongxin · 2024-11-27T02:22:55Z

src/flag_gems/ops/nllloss.py

+
+        if weight is None:
+            weight = torch.ones(
+                [


src/flag_gems/ops/nllloss.py

tongxin · 2024-11-27T02:58:10Z

src/flag_gems/ops/nllloss.py

+        tl.store(inp_grad_ptrs, inp_grad.to(tl.float32), mask=(inp_mask & ignore_mask))
+
+
+class NLLLoss(torch.autograd.Function):


This function is intended be used as substitute for nll_loss whereas NLLLoss is already taken as the nn module name. We should avoid the name confusion.

Indeed. I've upated the class name.

tongxin · 2024-11-27T02:59:44Z

src/flag_gems/ops/nllloss.py

+        if reduction == 0:
+            res = out.to(inp.dtype)
+        elif reduction == 1:
+            ctx.total_weight = sum(w_tgt).item()


Shall we also add dim= args to avoid confusion?

tongxin

LGTM

src/flag_gems/ops/nllloss.py

StrongSpoon · 2024-12-13T01:01:26Z

could you provide the performance data?

GwokHiujin · 2024-12-13T06:29:31Z

could you provide the performance data?

A general result is as below:

Operator: nll_loss  Performance Test (dtype=torch.float16, mode=cuda, level=comprehensive)
Size         Torch Latency (ms)    Gems Latency (ms)         Gems Speedup         Size Detail
------------------------------------------------------------------------------------------
SUCCESS               0.008192            0.430080               0.019          [torch.Size([64, 64]), torch.Size([64])]
SUCCESS               0.014336            0.626688               0.023          [torch.Size([256, 256]), torch.Size([256])]
SUCCESS               0.039936            1.439744               0.028          [torch.Size([1024, 1024]), torch.Size([1024])]
SUCCESS               0.146432            4.753408               0.031          [torch.Size([4096, 4096]), torch.Size([4096])]
SUCCESS               0.039936            1.436672               0.028          [torch.Size([1024, 65536]), torch.Size([1024])]
SUCCESS               0.037888            1.439744               0.026          [torch.Size([1024, 1]), torch.Size([1024])]
SUCCESS               0.034816            1.430528               0.024          [torch.Size([1024, 16]), torch.Size([1024])]
SUCCESS               0.043008            1.440768               0.030          [torch.Size([1024, 256]), torch.Size([1024])]
SUCCESS               0.039936            1.432576               0.028          [torch.Size([1024, 4096]), torch.Size([1024])]


Operator: nll_loss  Performance Test (dtype=torch.float32, mode=cuda, level=comprehensive)
Size         Torch Latency (ms)    Gems Latency (ms)         Gems Speedup         Size Detail
------------------------------------------------------------------------------------------
SUCCESS               0.008192            0.413696               0.020          [torch.Size([64, 64]), torch.Size([64])]
SUCCESS               0.014336            0.601088               0.024          [torch.Size([256, 256]), torch.Size([256])]
SUCCESS               0.039936            1.414144               0.028          [torch.Size([1024, 1024]), torch.Size([1024])]
SUCCESS               0.146432            4.697088               0.031          [torch.Size([4096, 4096]), torch.Size([4096])]
SUCCESS               0.038912            1.420288               0.027          [torch.Size([1024, 65536]), torch.Size([1024])]
SUCCESS               0.036864            1.413120               0.026          [torch.Size([1024, 1]), torch.Size([1024])]
SUCCESS               0.034816            1.412096               0.025          [torch.Size([1024, 16]), torch.Size([1024])]
SUCCESS               0.043008            1.417216               0.030          [torch.Size([1024, 256]), torch.Size([1024])]
SUCCESS               0.038912            1.411072               0.028          [torch.Size([1024, 4096]), torch.Size([1024])]


Operator: nll_loss  Performance Test (dtype=torch.bfloat16, mode=cuda, level=comprehensive)
Size         Torch Latency (ms)    Gems Latency (ms)         Gems Speedup         Size Detail
------------------------------------------------------------------------------------------
SUCCESS               0.008192            0.431104               0.019          [torch.Size([64, 64]), torch.Size([64])]
SUCCESS               0.014336            0.621568               0.023          [torch.Size([256, 256]), torch.Size([256])]
SUCCESS               0.039936            1.448960               0.028          [torch.Size([1024, 1024]), torch.Size([1024])]
SUCCESS               0.146432            4.698112               0.031          [torch.Size([4096, 4096]), torch.Size([4096])]
SUCCESS               0.039936            1.448960               0.028          [torch.Size([1024, 65536]), torch.Size([1024])]
SUCCESS               0.037888            1.452032               0.026          [torch.Size([1024, 1]), torch.Size([1024])]
SUCCESS               0.034816            1.444864               0.024          [torch.Size([1024, 16]), torch.Size([1024])]
SUCCESS               0.043008            1.442816               0.030          [torch.Size([1024, 256]), torch.Size([1024])]
SUCCESS               0.038912            1.446912               0.027          [torch.Size([1024, 4096]), torch.Size([1024])]

It is somewhat poor now. As mentioned before, we may use gather later for optimization.

StrongSpoon · 2024-12-31T02:46:20Z

src/flag_gems/ops/nllloss.py

+                    inp, tgt, w, w_tgt, out, ignore_index, N, C
+                )
+
+        ctx.save_for_backward(inp, tgt, w)


only saving tensors and variables when input requires gradient might decrease the cost.

StrongSpoon · 2024-12-31T02:47:02Z

src/flag_gems/ops/nllloss.py

+            res = out.to(inp.dtype)
+        elif reduction == 1:
+            ctx.total_weight = sum(w_tgt).item()
+            res = sum(out).to(inp.dtype) / ctx.total_weight


suggest fusing sum into forward kernel, referencing to cross_entropy_loss.

StrongSpoon · 2025-01-06T06:26:25Z

performance after optimization:

tongxin · 2025-01-06T07:53:36Z

src/flag_gems/ops/nllloss.py



 @libentry()
 @triton.autotune(
-    configs=[triton.Config({"BLOCK_N": n}, num_warps=4) for n in [256, 512, 1024]],
+    configs=[triton.Config({"BLOCK_N": n}, num_warps=4) for n in [1, 16, 256]],


Why BLOCK_N varies so much?

I think BLOCK = 128 is good pick.

what about [1, 4, 32, 128]?

tongxin · 2025-01-06T07:59:25Z

src/flag_gems/ops/nllloss.py

-        for c in [256, 512, 1024]
-        for d in [1, 4, 16]
-    ],
+    configs=[triton.Config({"BLOCK_D": d}, num_warps=4) for d in [1, 4, 16]],


Considering this as gather/scatter like kernel, BLOCK = 128 should be good enough.

tongxin

LGTM

StrongSpoon · 2025-01-15T01:43:35Z

the latest performance:

tongxin · 2025-01-15T06:46:27Z

src/flag_gems/ops/nllloss.py

+        tl.atomic_add(out_ptr + 2, 1, sem="release")  # counter
+        counter = tl.load(out_ptr + 2)
+        total_out = tl.load(out_ptr)
+        total_wgt = tl.load(out_ptr + 1)
+        tl.store(
+            out_ptr + 3, total_out / total_wgt, mask=(counter == tl.num_programs(0))
+        )


I think it's safer to use a stronger memory order for the counter update in line 49. And then we're ensured to have only one CTA to do the rest.

release is enough here. it's safe even if there are more than one CTAs satisfying the condition.

tongxin

Looks good now.

tongxin self-assigned this Nov 10, 2024

tongxin requested a review from StrongSpoon November 10, 2024 15:06

tongxin previously approved these changes Nov 15, 2024

View reviewed changes

GwokHiujin dismissed tongxin’s stale review via e99720e November 18, 2024 04:14

GwokHiujin requested a review from tongxin November 20, 2024 15:06

tongxin reviewed Nov 27, 2024

View reviewed changes

tongxin previously approved these changes Dec 12, 2024

View reviewed changes

src/flag_gems/ops/nllloss.py Outdated Show resolved Hide resolved

GwokHiujin dismissed tongxin’s stale review via 0a0723c December 12, 2024 10:57

StrongSpoon reviewed Dec 31, 2024

View reviewed changes

StrongSpoon force-pushed the nll_loss branch from 2149f9d to 81db135 Compare January 6, 2025 06:24

tongxin reviewed Jan 6, 2025

View reviewed changes

StrongSpoon force-pushed the nll_loss branch from 81db135 to bf4835e Compare January 6, 2025 08:46

tongxin previously approved these changes Jan 9, 2025

View reviewed changes

StrongSpoon dismissed tongxin’s stale review via 87ec532 January 9, 2025 08:25

tongxin previously approved these changes Jan 9, 2025

View reviewed changes

StrongSpoon dismissed tongxin’s stale review via 88f606a January 10, 2025 08:56

tongxin reviewed Jan 15, 2025

View reviewed changes

GwokHiujin and others added 8 commits January 15, 2025 15:29

[Operator] Init NLL_LOSS

c052581

[Bugfix] Fix NLLLoss accuracy test to put gradient on the same device

1be3e31

[Chore] Apply minor modifications to NLLLoss

bbb2cd4

[Chore] Register NLL_LOSS

92a0089

[Chore] Change register name of NLL_LOSS

cc94a10

[Operator] Optimize nll_loss and achieve

36a4c15

[Operator] update tiling config

b9384c6

[Operator] set default tiling size for forward function

0fef3ba

StrongSpoon added 8 commits January 15, 2025 15:29

[Operator] set block_size as 128 and simplify the code

29d7e3d

[Format] reformat

921bab7

[Format] reformat arg procession in libentry

fac9920

[Operator] fix bug in unit test of nll_loss

cd1bb67

[Operator] register operator as nll_loss_nd to support 3d tensor

6e961f6

[Operator] reimplement nll_loss

fbe338b

[Operator] fuse nll_loss forward kernels into one

7b65d50

[Operator] add assertation and modify condition expression

ed84cc8

StrongSpoon force-pushed the nll_loss branch from 4cb3821 to ed84cc8 Compare January 15, 2025 07:29

[Operator] optimize

201cf2a

tongxin approved these changes Jan 15, 2025

View reviewed changes

StrongSpoon merged commit 08796d1 into master Jan 15, 2025
8 of 9 checks passed

StrongSpoon deleted the nll_loss branch January 15, 2025 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Operator] Init NLL_LOSS #269

[Operator] Init NLL_LOSS #269

GwokHiujin commented Oct 30, 2024

tongxin left a comment

tongxin Nov 27, 2024

GwokHiujin Dec 3, 2024

tongxin Nov 27, 2024

GwokHiujin Dec 3, 2024

tongxin Nov 27, 2024

GwokHiujin Dec 3, 2024

tongxin Nov 27, 2024

tongxin left a comment

StrongSpoon commented Dec 13, 2024

GwokHiujin commented Dec 13, 2024

StrongSpoon Dec 31, 2024

StrongSpoon Dec 31, 2024

StrongSpoon commented Jan 6, 2025

tongxin Jan 6, 2025

tongxin Jan 6, 2025

StrongSpoon Jan 6, 2025

tongxin Jan 6, 2025

tongxin left a comment

StrongSpoon commented Jan 15, 2025

tongxin Jan 15, 2025

StrongSpoon Jan 15, 2025

tongxin left a comment

		tl.store(inp_grad_ptrs, inp_grad.to(tl.float32), mask=(inp_mask & ignore_mask))


		class NLLLoss(torch.autograd.Function):

[Operator] Init NLL_LOSS #269

[Operator] Init NLL_LOSS #269

Conversation

GwokHiujin commented Oct 30, 2024

tongxin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tongxin left a comment

Choose a reason for hiding this comment

StrongSpoon commented Dec 13, 2024

GwokHiujin commented Dec 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrongSpoon commented Jan 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tongxin left a comment

Choose a reason for hiding this comment

StrongSpoon commented Jan 15, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tongxin left a comment

Choose a reason for hiding this comment