Why do we have 3.7x TMA instructions compared to cuBLAS? #3569

zasdfgbnm · 2024-12-11T02:16:56Z

I am looking at the nsight compute file in #3568, and noticed this:

cuBLAS:

nvFuser:

Why is this the case? As shown in https://github.com/NVIDIA/Fuser/blob/64ee035dc61c92da43e6da302c00fa79dea14dba/__tmp_kernel_none_f0_c0_r0_g0.cu, we are currently loading as

        mbarrier::arriveExpectTX(toSmem((&T8[(i24 % 4)])), 8192U);
        #pragma unroll
        for(nvfuser_index_t i29 = 0; i29 < 4; ++i29) {
          Hopper::cpAsyncBulkTensorTileG2S((Hopper::CpAsyncBulkTensorTileG2SIndex<2>{ ptr4, (Array<nvfuser_index_t, 2, 1>{(i5 + (64 * i29)), i25}), toSmem((&T8[(i24 % 4)])) }), (i27 + (2048 * i29)));
        }
        mbarrier::arriveExpectTX(toSmem((&T8[(i24 % 4)])), 4096U);
        #pragma unroll
        for(nvfuser_index_t i30 = 0; i30 < 2; ++i30) {
          Hopper::cpAsyncBulkTensorTileG2S((Hopper::CpAsyncBulkTensorTileG2SIndex<2>{ ptr7, (Array<nvfuser_index_t, 2, 1>{(i8 + (64 * i30)), i25}), toSmem((&T8[(i24 % 4)])) }), (i28 + (2048 * i30)));
        }

Do we really need that much of TMA instructions? Can some of them be batched?

Note that the number of TMA instructions is related to the memory layout we put for A and B on smem. The wgmma instruction only requires that core matrices are linear on the M/N/K dimensions, and we do have the freedom to choose our own memory layout as long as it is linear. I think we should choose the layout that can be loaded with as less TMA instructions as possible, and I am worried that we are not choosing the best layout.

In our lowering, proveLinearAndGetStride do have the flexibility to choose different layout as well, as long as it is linear. So if we do need to change the layout, I believe we only need to change the schedule.

The text was updated successfully, but these errors were encountered:

zasdfgbnm added the Matmuls label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do we have 3.7x TMA instructions compared to cuBLAS? #3569

Why do we have 3.7x TMA instructions compared to cuBLAS? #3569

zasdfgbnm commented Dec 11, 2024 •

edited

Loading

Why do we have 3.7x TMA instructions compared to cuBLAS? #3569

Why do we have 3.7x TMA instructions compared to cuBLAS? #3569

Comments

zasdfgbnm commented Dec 11, 2024 • edited Loading

zasdfgbnm commented Dec 11, 2024 •

edited

Loading