-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI/Build] Dockerfile build for ARM64 / GH200 #10499
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: drikster80 <[email protected]>
Signed-off-by: drikster80 <[email protected]>
Missed a sign-off on 1 commit, so rebased and force-pushed to pass the DCO check. |
Signed-off-by: drikster80 <[email protected]>
Noticed a bug where flashinfer x86_64 wheel was not installing by default. Since this was the default behavior on non-arm64 systems before, updated the conditional to always apply unless the target platform is specified as 'linux/arm64'. |
if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \ | ||
apt-get update && apt-get install zlib1g-dev && \ | ||
python3 -m pip install packaging pybind11 && \ | ||
git clone https://github.com/openai/triton && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we directly use pytorch nightly as base image so that we don't need to build triton, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. Triton doesn't provide aarch64 whl files, so we'll always need to compile it if we want to use the latest version: https://pypi.org/project/triton/#files
It probably is a good idea to pin to the latest release tag of triton, instead of the main though. I'll update that.
My goal on this was to keep it as close as possible to the x86_64 implementation of VLLM, so I didn't want to use the nvidia pytorch container. That's what I was doing in the previous repo. Although it worked, it doubled the size of the final image (9.74GB vs 4.89GB).
Dockerfile
Outdated
RUN --mount=type=cache,target=/root/.cache/pip \ | ||
--mount=type=bind,source=.git,target=.git \ | ||
if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \ | ||
pip --verbose wheel --use-pep517 --no-deps -w /workspace/dist --no-build-isolation git+https://github.com/vllm-project/flash-attention.git ; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the vllm build already includes vllm-flash-attention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe torch version should be unpinned from the source in CMakeList.txx, setup.py and pyproject.toml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the vllm build already includes vllm-flash-attention
Ah, good point. I'll remove that and test.
@drikster80 overall it makes sense to me, but we don't need to build so many things in the docker. Just use the default should be fine, it already comes with flash-attention backend. we don't need to build flashinfer / bitsandbytes / triton . |
@@ -0,0 +1,3 @@ | |||
--index-url https://download.pytorch.org/whl/nightly/cu124 | |||
torchvision; platform_machine == 'aarch64' | |||
torch; platform_machine == 'aarch64' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add xformers for aarch64 to the /vllm-project directory similar to flash-attention for the aarch64 build until the upstream pip package is available
Dockerfile
Outdated
RUN --mount=type=cache,target=/root/.cache/pip \ | ||
--mount=type=bind,source=.git,target=.git \ | ||
if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \ | ||
pip --verbose wheel --use-pep517 --no-deps -w /workspace/dist --no-build-isolation git+https://github.com/vllm-project/flash-attention.git ; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe torch version should be unpinned from the source in CMakeList.txx, setup.py and pyproject.toml
None of these have aarch64 whl. When you say "use the default", are these all built into vllm as well? When I attempt to run the container without building these, it fails. |
the goal here is to have a runnable image for vllm on arm64 / GH200 . we don't need to have full features here. since the community is not fully ready for arm64, it would be a maintenance disaster if we build so many things here by ourselves. if a library does not support arm64, people should reach out to that library and let that library be compatible with arm64. that's why I want to use pytorch nightly docker directly. docker image size is not my concern.
this is not my goal. the first step is we can run |
Okay, it sounds like our goals just weren't aligned. I agree it could become a maintainability issue this way. FWIW, it looks like the other libraries do support ARM64, but don't provide a whl for them on pypi (probably due to GitHub Actions limitations). I'll create tickets on the other repos requesting the aarch64 whl be build/provided. I had originally moved away from using the nvidia-pytorch container because they were slower at updating torch than VLLM was. It looks like they just came out with a version compatible with torch v2.6, so I can try to use that version. In the meantime, I'll continue maintaining the fork and hosting a full-featured version under my docker-hub that matches the releases of VLLM. |
We don't need nvidia-pytorch container. A basic nvidia container is good enough, and we can just install nightly pytorch wheels.
thanks for your efforts! for this PR, let's get the basic support first 👍 |
[CI/Build] Dockerfile build for ARM64 / GH200 vllm-project#10499 by cenzhiyao
close as #11212 has been merged. @drikster80 thanks for your efforts! please continue to keep your branch with full-fledged feature. |
Updates the Dockerfile with $TARGETPLATFORM conditionals that will compile the necessary modules and extensions for aarch64 / ARM64 systems. This has been tested on the Nvidia GH200 platform.
Docker builds should use
--platform "linux/arm64"
to trigger the arm64 build process.FIX #2021
Changes Overview:
requirements-cuda-arm64.txt
that uses the pytorch nightly modules that are compatible with ARM64+CUDA. This is temporary until they are moved to stable release (at which time this file can be removed).platform_machine != 'aarch64'
.The following command was used to build and confirmed working on Nvidia GH200:
NOTE: The order of the installing the requirements-cuda-arm64.txt is important since it needs to stomp over the currently installed torch version that are dependencies to other modules.