[BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE Group Act Order #11528

dsikka · 2024-12-26T19:40:25Z

Summary

Fix parameter names to ensure proper weight loading
Fix process_after_weight_loading if running group act order
Don't shard w2 weight_scales when running actorder for tp>1

Testing

Tested with tp>=1 with mixtral and deepseek WNA16

Next Steps:

GPTQMarlin MoE method likely needs to be updated on its sharding condition for actorder

github-actions · 2024-12-26T19:40:39Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

eldarkurtic · 2024-12-26T22:02:57Z

I can confirm it works with tp>1

dsikka · 2024-12-27T02:18:24Z

@mgoin
@robertgshaw2-neuralmagic
@ElizaWszola

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

mergify · 2025-01-07T16:42:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dsikka.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

ElizaWszola · 2025-01-17T06:16:35Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

        # Will transpose the loaded weight along the
        # intermediate and hidden dim sizes. Will
        # shard for TP along the transposed dims
+        intermediate_full = extra_weight_attrs.pop("intermediate_full")


nit: maybe rename intermediate_size -> intermediate_size_per_partition and intermediate_full -> intermediate_size?
This would make the names consistent with other quant configs, e.g. vllm/model_executor/layers/quantization/gptq.py

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

Signed-off-by: ElizaWszola <[email protected]>

…pressed_tensors_moe.py Co-authored-by: Michael Goin <[email protected]>

dsikka marked this pull request as ready for review December 27, 2024 02:17

robertgshaw2-redhat reviewed Dec 27, 2024

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py Show resolved Hide resolved

mergify bot added needs-rebase and removed needs-rebase labels Jan 7, 2025

dsikka force-pushed the fix_w4a16moe_actorder branch from c7a912e to 4c71143 Compare January 10, 2025 02:34

dsikka requested a review from robertgshaw2-redhat January 10, 2025 02:35

mgoin reviewed Jan 10, 2025

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py Outdated Show resolved Hide resolved

ElizaWszola reviewed Jan 13, 2025

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py Outdated Show resolved Hide resolved

dsikka requested review from ElizaWszola and mgoin January 16, 2025 21:35

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 16, 2025

ElizaWszola reviewed Jan 17, 2025

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py Outdated Show resolved Hide resolved

ElizaWszola reviewed Jan 17, 2025

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py Outdated Show resolved Hide resolved

dsikka and others added 7 commits January 17, 2025 17:55

fix process after weight loading for actorder and parameter names

866b057

update

5c731cd

Don't partition w2 when we use group quantization

ddfac98

Signed-off-by: ElizaWszola <[email protected]>

fix condition for is_k_full; clean-up

82665f7

Update vllm/model_executor/layers/quantization/compressed_tensors/com…

7ce3de7

…pressed_tensors_moe.py Co-authored-by: Michael Goin <[email protected]>

clean up params; add assertion, update k_full condition

e5864f8

fix typos; remove comment

c2bce52

dsikka force-pushed the fix_w4a16moe_actorder branch from a68004b to c2bce52 Compare January 17, 2025 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE Group Act Order #11528

[BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE Group Act Order #11528

dsikka commented Dec 26, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 26, 2024

eldarkurtic commented Dec 26, 2024

dsikka commented Dec 27, 2024

mergify bot commented Jan 7, 2025

ElizaWszola Jan 17, 2025

[BugFix] Fix parameter names and process_after_weight_loading for W4A16 MoE Group Act Order #11528

Are you sure you want to change the base?

[BugFix] Fix parameter names and process_after_weight_loading for W4A16 MoE Group Act Order #11528

Conversation

dsikka commented Dec 26, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 26, 2024

eldarkurtic commented Dec 26, 2024

dsikka commented Dec 27, 2024

mergify bot commented Jan 7, 2025

ElizaWszola Jan 17, 2025

Choose a reason for hiding this comment

[BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE Group Act Order #11528

[BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE Group Act Order #11528

dsikka commented Dec 26, 2024 •

edited by github-actions bot

Loading