Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why do we have 3.7x TMA instructions compared to cuBLAS? #3569

Open
zasdfgbnm opened this issue Dec 11, 2024 · 0 comments
Open

Why do we have 3.7x TMA instructions compared to cuBLAS? #3569

zasdfgbnm opened this issue Dec 11, 2024 · 0 comments
Labels

Comments

@zasdfgbnm
Copy link
Collaborator

zasdfgbnm commented Dec 11, 2024

I am looking at the nsight compute file in #3568, and noticed this:

cuBLAS:

Image

nvFuser:

Image

Why is this the case? As shown in https://github.com/NVIDIA/Fuser/blob/64ee035dc61c92da43e6da302c00fa79dea14dba/__tmp_kernel_none_f0_c0_r0_g0.cu, we are currently loading as

        mbarrier::arriveExpectTX(toSmem((&T8[(i24 % 4)])), 8192U);
        #pragma unroll
        for(nvfuser_index_t i29 = 0; i29 < 4; ++i29) {
          Hopper::cpAsyncBulkTensorTileG2S((Hopper::CpAsyncBulkTensorTileG2SIndex<2>{ ptr4, (Array<nvfuser_index_t, 2, 1>{(i5 + (64 * i29)), i25}), toSmem((&T8[(i24 % 4)])) }), (i27 + (2048 * i29)));
        }
        mbarrier::arriveExpectTX(toSmem((&T8[(i24 % 4)])), 4096U);
        #pragma unroll
        for(nvfuser_index_t i30 = 0; i30 < 2; ++i30) {
          Hopper::cpAsyncBulkTensorTileG2S((Hopper::CpAsyncBulkTensorTileG2SIndex<2>{ ptr7, (Array<nvfuser_index_t, 2, 1>{(i8 + (64 * i30)), i25}), toSmem((&T8[(i24 % 4)])) }), (i28 + (2048 * i30)));
        }

Do we really need that much of TMA instructions? Can some of them be batched?

Note that the number of TMA instructions is related to the memory layout we put for A and B on smem. The wgmma instruction only requires that core matrices are linear on the M/N/K dimensions, and we do have the freedom to choose our own memory layout as long as it is linear. I think we should choose the layout that can be loaded with as less TMA instructions as possible, and I am worried that we are not choosing the best layout.

In our lowering, proveLinearAndGetStride do have the flexibility to choose different layout as well, as long as it is linear. So if we do need to change the layout, I believe we only need to change the schedule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant