Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Don't initialize TMA output buffer (#3105)
TMA will automatically fill zero, besides, the initialization will race with TMA itself as there is no sync between initialization and TMA. Matmul perf after enabling circular buffering: ```markdown Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- --------- --------- -------- -------- ----------- ---------------------------------------------------------------------------------------------------- 57.3 3218718 1 3218718.0 3218718.0 3218718 3218718 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>… 12.5 700153 1 700153.0 700153.0 700153 700153 0.0 nvjet_hsh_192x192_64x3_2x1_v_bz_coopB_NTN ```
- Loading branch information