[Transpiler] inlining `TB_FORLOOP_ACCUM_NO_RED_OP` operator #115

jiazhihao · 2024-10-24T01:02:27Z

An inefficiency of the current transpiler is that it allocates new stensors for the output of ForloopAccumulator (no reduction), resulting in higher shared memory usage than necessary. One optimization we can enable in the transpiler is inlining the output of the ForloopAccumulator.

For example, consider the case where a threadblock matmul is followed by a ForloopAccumulator.

C = matmul(A, B)
D = forloop_accum(C)

Currently, we allow shared memory for both C and D. To reduce shared memory requirement, we can directly accumulate the output of matmul in D.

The text was updated successfully, but these errors were encountered:

jiazhihao assigned xinhaoc and jiazhihao and unassigned jiazhihao Oct 24, 2024

jiazhihao added the CUDA Transpiler Issues and features related to the CUDA transpiler of Mirage label Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Transpiler] inlining `TB_FORLOOP_ACCUM_NO_RED_OP` operator #115

[Transpiler] inlining `TB_FORLOOP_ACCUM_NO_RED_OP` operator #115

jiazhihao commented Oct 24, 2024

[Transpiler] inlining TB_FORLOOP_ACCUM_NO_RED_OP operator #115

[Transpiler] inlining TB_FORLOOP_ACCUM_NO_RED_OP operator #115

Comments

jiazhihao commented Oct 24, 2024

[Transpiler] inlining `TB_FORLOOP_ACCUM_NO_RED_OP` operator #115

[Transpiler] inlining `TB_FORLOOP_ACCUM_NO_RED_OP` operator #115