Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Transpiler] inlining TB_FORLOOP_ACCUM_NO_RED_OP operator #115

Open
jiazhihao opened this issue Oct 24, 2024 · 0 comments
Open

[Transpiler] inlining TB_FORLOOP_ACCUM_NO_RED_OP operator #115

jiazhihao opened this issue Oct 24, 2024 · 0 comments
Assignees
Labels
CUDA Transpiler Issues and features related to the CUDA transpiler of Mirage

Comments

@jiazhihao
Copy link
Member

An inefficiency of the current transpiler is that it allocates new stensors for the output of ForloopAccumulator (no reduction), resulting in higher shared memory usage than necessary. One optimization we can enable in the transpiler is inlining the output of the ForloopAccumulator.

For example, consider the case where a threadblock matmul is followed by a ForloopAccumulator.

C = matmul(A, B)
D = forloop_accum(C)

Currently, we allow shared memory for both C and D. To reduce shared memory requirement, we can directly accumulate the output of matmul in D.

@jiazhihao jiazhihao assigned xinhaoc and jiazhihao and unassigned jiazhihao Oct 24, 2024
@jiazhihao jiazhihao added the CUDA Transpiler Issues and features related to the CUDA transpiler of Mirage label Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUDA Transpiler Issues and features related to the CUDA transpiler of Mirage
Projects
None yet
Development

No branches or pull requests

2 participants