[Transpiler] inlining TB_FORLOOP_ACCUM_NO_RED_OP
operator
#115
Labels
CUDA Transpiler
Issues and features related to the CUDA transpiler of Mirage
An inefficiency of the current transpiler is that it allocates new stensors for the output of ForloopAccumulator (no reduction), resulting in higher shared memory usage than necessary. One optimization we can enable in the transpiler is inlining the output of the ForloopAccumulator.
For example, consider the case where a threadblock matmul is followed by a ForloopAccumulator.
Currently, we allow shared memory for both
C
andD
. To reduce shared memory requirement, we can directly accumulate the output ofmatmul
inD
.The text was updated successfully, but these errors were encountered: