-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking perf optimization of HopperMatmulTest.HSH_NT_128BSwizzle
for problem size (M=2048, N=2048, K=8192)
, CTA tile size (128, 256)
#3279
Comments
On H200: Initial perf as measured in #3281:
nvFuser/cuBLAS = |
zasdfgbnm
added a commit
that referenced
this issue
Oct 26, 2024
This shape makes more sense: #3137 (comment), #3279 Perf: ``` Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- -------- -------- -------- -------- ----------- ---------------------------------------------------------------------------------------------------- 43.2 205150 1 205150.0 205150.0 205150 205150 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>… 18.5 87550 1 87550.0 87550.0 87550 87550 0.0 nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT ``` nvFuser/cuBLAS = `42.7%`
On H200: There is a perf regression after the fix of elect-sync: Perf: Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns)
Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
47.8 247326 1 247326.0 247326.0 247326 247326 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
17.0 88191 1 88191.0 88191.0 88191 88191 0.0 nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT Perf nvFuser/cuBLAS: |
On H200: After #3294: Perf: Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns)
Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
39.0 172735 1 172735.0 172735.0 172735 172735 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
20.0 88768 1 88768.0 88768.0 88768 88768 0.0 nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT Perf nvFuser/cuBLAS: |
On H200: After #3314 Perf: Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
36.0 151775 1 151775.0 151775.0 151775 151775 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
20.7 87135 1 87135.0 87135.0 87135 87135 0.0 nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT nvFuser/cuBLAS = |
zasdfgbnm
added a commit
that referenced
this issue
Dec 10, 2024
When used with #3545, this contribute a speedup of 5% of cuBLAS! Perf together with #3545 on H100: ``` Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- -------- -------- -------- -------- ----------- ---------------------------------------------------------------------------------------------------- 33.8 136319 1 136319.0 136319.0 136319 136319 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>… 22.7 91487 1 91487.0 91487.0 91487 91487 0.0 nvjet_hsh_128x256_64x4_2x1_v_bz_coopA_NTN ``` nvFuser/cuBLAS: 67% Note that the above test is run with smem epilogue disabled. I will run a test with everything combined later. Also note that this number is on H100, which is different from the H200 in #3279.
Unfortunately, I no longer get H200 machines from our cluster. Here is the result on H100 after #3545 and #3547:
nvFuser/cuBLAS: 67% |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The CTA tile size
(128, 256)
is a size that can relatively easily achieve high math throughput. The problem size is carefully selected as one full wave. I believe this is a good incremental task.Benchmark command:
Current perf on H100 on main as in the latest comment:
nvFuser/cuBLAS: 67%
The text was updated successfully, but these errors were encountered: