-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] vLLM ARM Enablement for AARCH64 CPUs #9228
[Feature] vLLM ARM Enablement for AARCH64 CPUs #9228
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
@youkaichao @ChipKerchner @mgoin @pathorn @manojnkumar I have added a new feature that allows vLLM to run on ARM CPU backend. I have tested on AWS Graviton 3E. Please have a look at this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @sanketkaleoss! This seems reasonable to me as a base implementation. It seems only compute in fp32 is supported, is that right?
It would be good to update the cpu installation documentation with how to build and to also add a new Dockerfile.arm
cmake/cpu_extension.cmake
Outdated
message(STATUS "ARMv8 architecture detected") | ||
list(APPEND CXX_COMPILE_FLAGS | ||
"-mcpu=native" | ||
"-march=armv8.6-a" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this specific arch chosen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
armv8.6-a is the first architecture that supports advance SIMD instructions, bf16 support and SVE support. That's why I chose this specific arch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now, but for instance I think Apple's M1 CPU uses ARMv8.4-A so we should consider supporting older versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, noted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to wikichip, Graviton3 is also ARMv8.4A -- is the Graviton3E different? I haven't found any documentation on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tlrmchlsmth You're right, graviton 3 has armv8.4. The code needs 8.6 only for bf16 dependencies. But if I just use "-mcpu=native" or "-march=armv8.4-a+bf16" the code works fine on Graviton3 instances. What do you suggest here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If -march=armv8.4-a+bf16
works, then I suggest using that for this PR.
In general, I think it's best to go with a fairly minimal ISA version, and then explicitly specify what ISA extensions we want to use, especially because ARM has many extensions that are optional for several ISA versions before they become mandatory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, the M1 Mac (and I think M2) does not support BF16 NEON so unfortunately this wouldn't help there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mgoin , @tlrmchlsmth I've added the dockerfile and documentation. Changed the arm arch to native. Please have a look.
Thanks @mgoin for the review. It supports fp32 and bf16 as of now. Sure, I'll work on updating the documentation and adding Dockerfile too. |
this is great, I will hand it over to @mgoin for detailed review. it would be better if we can support macos m chips as well, they are also arm chips. |
Thanks @youkaichao . I'll add that to future work. |
cmake/cpu_extension.cmake
Outdated
message(STATUS "ARMv8 architecture detected") | ||
list(APPEND CXX_COMPILE_FLAGS | ||
"-mcpu=native" | ||
"-march=armv8.6-a" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If -march=armv8.4-a+bf16
works, then I suggest using that for this PR.
In general, I think it's best to go with a fairly minimal ISA version, and then explicitly specify what ISA extensions we want to use, especially because ARM has many extensions that are optional for several ISA versions before they become mandatory
cmake/cpu_extension.cmake
Outdated
message(STATUS "ARMv8 architecture detected") | ||
list(APPEND CXX_COMPILE_FLAGS | ||
"-mcpu=native" | ||
"-march=armv8.6-a" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, the M1 Mac (and I think M2) does not support BF16 NEON so unfortunately this wouldn't help there.
csrc/cpu/cpu_types_arm.hpp
Outdated
|
||
namespace vec_op { | ||
|
||
// FIXME: FP16 is not fully supported in Torch-CPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanketkaleoss do you have plans to support FP16 in a future PR? I see that it's partially implemented.
Do you know what the problem with FP16 in Torch-CPU is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what's the problem there. I can add the support in future PR if torch-CPU supports FP16 in future.
Hello, I followed your Dockerfile.arm encountered during the build process。 |
cmake/cpu_extension.cmake
Outdated
elseif (POWER9_FOUND OR POWER10_FOUND) | ||
message(STATUS "PowerPC detected") | ||
# Check for PowerPC VSX support | ||
list(APPEND CXX_COMPILE_FLAGS | ||
"-mvsx" | ||
"-mcpu=native" | ||
"-mtune=native") | ||
|
||
elseif (ASIMD_FOUND) | ||
message(STATUS "ARMv8 architecture detected") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it could be ARMv9:
message(STATUS "ARMv8 architecture detected") | |
message(STATUS "ARMv8 or later architecture detected") |
or maybe explicitly call out NEON instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, sure
cmake/cpu_extension.cmake
Outdated
"-mcpu=native" | ||
"-mtune=native" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Native is OK for local development, however the binaries won't be portable at all.
I actually thought you were on the right track before this change. We really ought to explicitly specify a fairly minimal base ARM architecture (maybe ARMv8.4) and then explicitly the set of extensions that we need to build (BF16, and maybe FP16 and DotProd?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I'll try to set it as armv8.4 as a base implementation. Then, depending on the flags it would run bf16 or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try using "-march=armv8.4-a+bf16+dotprod+fp16"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for your guys reference, on arm64 -mcpu= acts as both specifying the appropriate architecture and tuning and it's generally better to use that vs -march if you're building for a specific CPU, you can find more details from here if you are running any test on Graviton: https://github.com/aws/aws-graviton-getting-started/blob/main/c-c++.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try using
"-march=armv8.4-a+bf16+dotprod+fp16"
?
Yes, I tried it and it works. It even works with "-march=armv8.2-a+bf16+dotprod+fp16"
making it compatible with Graviton2 as suggested by @ddynwzh1992 .
This pull request has merge conflicts that must be resolved before it can be |
@mgoin @tlrmchlsmth I have created separate paths for FP32 and BF16 datatypes. This code now works on any arm machine from |
9589161
to
3d46d2e
Compare
This pull request has merge conflicts that must be resolved before it can be |
47adfa6
to
4330083
Compare
Hi @mgoin @tlrmchlsmth @ddynwzh1992 , I've implemented the suggested changes and now ARM CPU backend path supports FP32, FP16 and BF16 datatypes. Support for MAC devices has been added. I've tested on |
9fb51cb
to
73726e1
Compare
Maybe I missed something, how will it choose FP32 path if bp16 not supported? Thank you. |
Earlier it used to give compilation error even if we just want to use FP32 and not BF16, as there were no separate paths for FP32 and Bf16 in x86 backend case. Now, it will compile successfully even if the system doesn't have BF16 extension. User can run by selecting FP32 dtype in that case. |
@mgoin @tlrmchlsmth @WoosukKwon @youkaichao @DarkLight1337 Your attention is required on this PR. Thanks. |
I thought you meant BF16 is supported without its hardware feature. Edited: from Apple M2, BF16 has been supported. |
if (remainder > 0) { | ||
float16x8_t temp = reg.val[full_blocks]; | ||
for (int i = 0; i < remainder; ++i) { | ||
reinterpret_cast<__fp16*>(ptr)[full_blocks * 8 + i] = vgetq_lane_f16(temp, i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, are you sure this line can compile? The index must be a compile-time constant, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. It compiles on my system as of now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reply. I found that with gcc v12.3 (Ubuntu), optimization O1 and above does not require const (while O0 does), with clang v15.0 (macOS), I haven't been able to compile it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reply. I found that with gcc v12.3 (Ubuntu), optimization O1 and above does not require const (while O0 does), with clang v15.0 (macOS), I haven't been able to compile it.
I see, thanks for the information.
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Sanket Kale <[email protected]>
…ture to native Signed-off-by: Sanket Kale <[email protected]>
Signed-off-by: Sanket Kale <[email protected]>
Signed-off-by: Sanket Kale <[email protected]>
Signed-off-by: Sanket Kale <[email protected]>
Signed-off-by: Sanket Kale <[email protected]>
Signed-off-by: Sanket Kale <[email protected]>
Signed-off-by: Sanket Kale <[email protected]>
Signed-off-by: Sanket Kale <[email protected]>
Signed-off-by: Sanket Kale <[email protected]>
Signed-off-by: Sanket Kale <[email protected]>
Signed-off-by: Sanket Kale <[email protected]>
73726e1
to
f1bf96a
Compare
Thanks for the detailed review @mgoin ! I have rebased and implemented the suggested changes. Please proceed with the CI checks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for losing track, merged with main to see if green
No worries, looking forward to the future work on this PR. |
Signed-off-by: Sanket Kale <[email protected]> Co-authored-by: Sanket Kale <[email protected]> Co-authored-by: mgoin <[email protected]> Signed-off-by: Andrew Feldman <[email protected]>
@sanketkaleoss Hi, thank you for the PR. Does additional work needs to be done to support |
@animalnots Did you figure that out? There's an issue related to it: #11154 |
Signed-off-by: Sanket Kale <[email protected]> Co-authored-by: Sanket Kale <[email protected]> Co-authored-by: mgoin <[email protected]>
I ended up using this satvikahuja/Easy-qwen2vlm2b-4macbook#1 Vllm never worked and instead would output "!!!!!!!!!..." |
Hi, sorry for the late reply. Does the model work with x86 CPU path? |
Description
This PR enables support of vLLM for AARCH64 architecture. Motivated by the requirements from (#176, #5741, etc), I implemented this PR which enables the ARM path for CPU inference.
ARM Compatibility: Modified the build scripts and configuration files to ensure compatibility with ARM processors. It currently supports float32, fp16 and bfloat16 datatypes.
Motivation
Enabling vLLM on ARM architecture broadens its usability, allowing it to run on a wider range of devices, including those with ARM processors. This enhancement is crucial for expanding the reach and applicability of vLLM in various use cases.
Checklist
Modifications
Performance Results
Model : facebook/opt-125m
Datatype : float32