[Feature][Hardware][AMD] Enable level 3 compilation on rocm #10836

charlifu · 2024-12-02T20:50:24Z

This PR fixs the fusion pass not enabled on rocm by:

add fp8 dtype selection to the fusion pass, since rocm use torch.float8funz
use tensor slice operation to replace the torch.narrow op which creates extra ops in the IR generated by torch.compile and makes the rms+fp8_quant fusion [torch.compile] Fuse RMSNorm with quant #9138 not working.

github-actions · 2024-12-02T20:50:38Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

vllm/model_executor/layers/quantization/utils/w8a8_utils.py

mgoin

Could you consider adding AMD to this fusion test case?

vllm/tests/compile/test_fusion.py

Lines 45 to 47 in 9b14d97

    
           @pytest.mark.skipif(envs.VLLM_TARGET_DEVICE != "cuda", 
        
                               reason="Only test on CUDA") 
        
           def test_fusion_rmsnorm_quant(dtype, hidden_size, num_tokens, eps):

charlifu · 2024-12-03T20:13:36Z

@ProExpertProg @mgoin Some updates:

I was trying to enable the unit test of the fusion pass on rocm. I found that with num_token < 17, we are still seeing the extra ops even with the slice operations which fail the test. I think this is because we are padding the input when num_token < 17 and this adds the extra slice_scatter and slice ops into the generated IR.

May I ask it is ok to disable the padding by default? Link

ProExpertProg · 2024-12-03T20:43:55Z

I am looking into this and found the same issue.

There's also another problem, unrelated to fusion. When we compile with dynamic shape, the max expression will not be a part of the trace, and the graph will contain only the taken branch inside max. A dimension that's marked dynamic still has an underlying value, and that value is erroneously used in the max to pick the larger one, even though that shouldn't be known.

That means if we compile with a dynamic num_tokens that's larger than 17, the graph will always use s0 (the dynamic dimension in place of num_tokens) for the size of the tensor, even if the actual value of s0 is less than 17. The other case is worse: if we originally compile with s0 < 17, the size will always be 17, even if s0 > 17 during execution.

@mgoin I'd advocate for removing the padding in the short term - how often do we deal with num_tokens < 17? In the long term, we can fix the tracing and the fusion for the padded case

ProExpertProg · 2024-12-03T20:57:00Z

Minor correction: the max does get traced properly into a torch.sym_max(s0, 17) - but for some reason it isn't used so it gets optimized out during the autograd phase.

mgoin · 2024-12-03T21:07:37Z

Could we simply remove the padding in the non-CUDA case? The reason why it is there is because of bad scaled_mm performance on CUDA

ProExpertProg · 2024-12-03T21:37:29Z

Ok if we want to keep the padding I think I have a solution for the fusion (but not for dynamic shape compilation). I can implement it tomorrow

tests/compile/test_fusion.py

ProExpertProg · 2024-12-03T21:48:16Z

For what it's worth, torch.narrow gets lowered to a slice operation anyway, so at least in the torch.compile regime there should be no difference with slicing.

Signed-off-by: charlifu <[email protected]>

mergify · 2025-01-21T19:55:17Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @charlifu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

charlifu requested review from tlrmchlsmth and WoosukKwon as code owners December 2, 2024 20:50

charlifu force-pushed the enable_amd_torch_compile branch from 33e7055 to f441c65 Compare December 2, 2024 20:59

mgoin reviewed Dec 2, 2024

View reviewed changes

vllm/model_executor/layers/quantization/utils/w8a8_utils.py Show resolved Hide resolved

mgoin reviewed Dec 2, 2024

View reviewed changes

ProExpertProg reviewed Dec 3, 2024

View reviewed changes

tests/compile/test_fusion.py Show resolved Hide resolved

ProExpertProg approved these changes Dec 3, 2024

View reviewed changes

charlifu added 5 commits December 4, 2024 16:38

fix level 3 compilation on rocm

5961580

Signed-off-by: charlifu <[email protected]>

lint

331de60

Signed-off-by: charlifu <[email protected]>

remove last line

72a4d12

Signed-off-by: charlifu <[email protected]>

enable unit test for rocm

8ce4101

Signed-off-by: charlifu <[email protected]>

disable padding for non-cuda cases.

e5d9c7a

Signed-off-by: charlifu <[email protected]>

charlifu force-pushed the enable_amd_torch_compile branch from 129797d to e5d9c7a Compare December 4, 2024 16:38

ProExpertProg mentioned this pull request Dec 4, 2024

Fix for the padding in the non-cutlass-fp8 case #10902

Open

charlifu requested a review from mgoin December 5, 2024 16:27

DarkLight1337 added the rocm label Dec 11, 2024

mergify bot added the needs-rebase label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature][Hardware][AMD] Enable level 3 compilation on rocm #10836

[Feature][Hardware][AMD] Enable level 3 compilation on rocm #10836

charlifu commented Dec 2, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 2, 2024

mgoin left a comment

charlifu commented Dec 3, 2024 •

edited

Loading

ProExpertProg commented Dec 3, 2024

ProExpertProg commented Dec 3, 2024

mgoin commented Dec 3, 2024

ProExpertProg commented Dec 3, 2024

ProExpertProg commented Dec 3, 2024

mergify bot commented Jan 21, 2025

	@pytest.mark.skipif(envs.VLLM_TARGET_DEVICE != "cuda",
	reason="Only test on CUDA")
	def test_fusion_rmsnorm_quant(dtype, hidden_size, num_tokens, eps):

[Feature][Hardware][AMD] Enable level 3 compilation on rocm #10836

Are you sure you want to change the base?

[Feature][Hardware][AMD] Enable level 3 compilation on rocm #10836

Conversation

charlifu commented Dec 2, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 2, 2024

mgoin left a comment

Choose a reason for hiding this comment

charlifu commented Dec 3, 2024 • edited Loading

ProExpertProg commented Dec 3, 2024

ProExpertProg commented Dec 3, 2024

mgoin commented Dec 3, 2024

ProExpertProg commented Dec 3, 2024

ProExpertProg commented Dec 3, 2024

mergify bot commented Jan 21, 2025

charlifu commented Dec 2, 2024 •

edited by github-actions bot

Loading

charlifu commented Dec 3, 2024 •

edited

Loading