-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Perf] Reduce peak memory usage of llama #10339
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Maintaining multiple names here will cause both to be refcounted which increases the peak memory. This will manifest as more blocks on top of each other in the memory profile. Signed-off-by: andoorve <[email protected]>
358dd7e
to
5625ebe
Compare
Great idea! We could apply this to many other models |
@@ -90,8 +90,8 @@ def __init__( | |||
self.act_fn = SiluAndMul() | |||
|
|||
def forward(self, x): | |||
gate_up, _ = self.gate_up_proj(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think torch.compile
can do something similar, without renaming variables.
to keep the original semantic, maybe adding del x
would be more intuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
torch.compile
can do something similar, without renaming variables.
Yes, it can completely alleviate this problem, even when we consider cross-function refcounting which I'll cover in my investigation write-up.
to keep the original semantic, maybe adding
del x
would be more intuitive.
I think you might mean in this case del gate_up
? Yes indeed we can add del
s and make the variable names more descriptive. I just kept it as x
to avoid adding extra del
s and be similar to style of the rest of the function.
Signed-off-by: andoorve <[email protected]>
Signed-off-by: andoorve <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]>
Signed-off-by: andoorve <[email protected]> Signed-off-by: rickyx <[email protected]>
Signed-off-by: andoorve <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: andoorve <[email protected]>
Maintaining multiple names here will cause both to be refcounted which increases the peak memory. This will manifest as more blocks on top of each other in the memory profile:
This change will increase the number of available blocks as a result of profiling especially with longer context lengths. I will follow up with a more detailed investigation in another PR/Issue that discusses this in more depth. However, creating this PR as well now as this is more or less a well-contained low-risk change. Can add to more models as well once we review this.