Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Unroll warp-specialized loops (#3547)
When used with #3545, this contribute a speedup of 5% of cuBLAS! Perf together with #3545 on H100: ``` Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- -------- -------- -------- -------- ----------- ---------------------------------------------------------------------------------------------------- 33.8 136319 1 136319.0 136319.0 136319 136319 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>… 22.7 91487 1 91487.0 91487.0 91487 91487 0.0 nvjet_hsh_128x256_64x4_2x1_v_bz_coopA_NTN ``` nvFuser/cuBLAS: 67% Note that the above test is run with smem epilogue disabled. I will run a test with everything combined later. Also note that this number is on H100, which is different from the H200 in #3279.
- Loading branch information