You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Do we really need that much of TMA instructions? Can some of them be batched?
Note that the number of TMA instructions is related to the memory layout we put for A and B on smem. The wgmma instruction only requires that core matrices are linear on the M/N/K dimensions, and we do have the freedom to choose our own memory layout as long as it is linear. I think we should choose the layout that can be loaded with as less TMA instructions as possible, and I am worried that we are not choosing the best layout.
In our lowering, proveLinearAndGetStride do have the flexibility to choose different layout as well, as long as it is linear. So if we do need to change the layout, I believe we only need to change the schedule.
The text was updated successfully, but these errors were encountered:
I am looking at the nsight compute file in #3568, and noticed this:
cuBLAS:
nvFuser:
Why is this the case? As shown in https://github.com/NVIDIA/Fuser/blob/64ee035dc61c92da43e6da302c00fa79dea14dba/__tmp_kernel_none_f0_c0_r0_g0.cu, we are currently loading as
Do we really need that much of TMA instructions? Can some of them be batched?
Note that the number of TMA instructions is related to the memory layout we put for A and B on smem. The
wgmma
instruction only requires that core matrices are linear on the M/N/K dimensions, and we do have the freedom to choose our own memory layout as long as it is linear. I think we should choose the layout that can be loaded with as less TMA instructions as possible, and I am worried that we are not choosing the best layout.In our lowering,
proveLinearAndGetStride
do have the flexibility to choose different layout as well, as long as it is linear. So if we do need to change the layout, I believe we only need to change the schedule.The text was updated successfully, but these errors were encountered: