Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] DeepSeek-V3 Enhancements #11539

Open
10 tasks
simon-mo opened this issue Dec 27, 2024 · 13 comments
Open
10 tasks

[Model] DeepSeek-V3 Enhancements #11539

simon-mo opened this issue Dec 27, 2024 · 13 comments
Labels
new model Requests to new models performance Performance-related issues

Comments

@simon-mo
Copy link
Collaborator

simon-mo commented Dec 27, 2024

This issue tracks follow up enhancements after initial support for the Deepseek V3 model. Please feel free to chime in and contribute!

@simon-mo simon-mo added misc performance Performance-related issues new model Requests to new models and removed misc labels Dec 27, 2024
@simon-mo simon-mo changed the title [Model] Deepseek V3 Enhancements [Model] DeepSeek-V3 Enhancements Dec 27, 2024
@july8023
Copy link

If I want to deploy deepseek 600B use vllm and RTX4090, are there any restrictions? How many RTX 4090 do I need at least?

@fsaudm
Copy link

fsaudm commented Dec 31, 2024

Is inference with A100s supported? How about quantization??

@mphilippnv
Copy link

Deepseek v3 doesn't appear to support pipeline parallelism. I get this error when attempting to deploy to 2 8x H100 nodes:

NotImplementedError: Pipeline parallelism is only supported for the following  architectures: ['AquilaForCausalLM', 'AquilaModel', 'DeepseekV2ForCausalLM', 'GPT2LMHeadModel', 'InternLM2ForCausalLM', 'InternLMForCausalLM', 'InternVLChatModel', 'JAISLMHeadModel', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'NemotronForCausalLM', 'Phi3ForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'QWenLMHeadModel', 'Qwen2VLForConditionalGeneration'].

I'm using --tensor-parallel-size 8 --pipeline-parallel-size 2

@simon-mo
Copy link
Collaborator Author

@july8023 It should work on 4090, generally the models takes about 600GB memory, then you want about 100-300GB for KV cache so feel free to plan around that.
@fsaudm A100s are not supported because this models requires FP8 tensor cores.
@mphilippnv which version of vLLM are you using? You might need to update to v0.6.6 or higher.

@fsaudm
Copy link

fsaudm commented Dec 31, 2024

@simon-mo right, A100s don't support fp8. Would the arg --dtype bfloat16 suffice? If not, I found the bf16 version in Huggingface, any insights on whether that would work?

@simon-mo
Copy link
Collaborator Author

The model currently does not support --dtype bfloat16 because it is natively trained in fp8. Can you point me to the bf16 version?

@fsaudm
Copy link

fsaudm commented Dec 31, 2024

@simon-mo on HF: https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main

, on the official repo they provide a script to cast fp8 to bf16, but of course you can't do it on A100s... my guess is a good soul did it and uploaded it to HF. In the repo, see 6.

https://github.com/deepseek-ai/DeepSeek-V3

@simon-mo
Copy link
Collaborator Author

vLLM does support this bf16 model on A100. It looks like the config.json properly removed quantization_config so it would already.

@mphilippnv
Copy link

mphilippnv commented Dec 31, 2024

@july8023 It should work on 4090, generally the models takes about 600GB memory, then you want about 100-300GB for KV cache so feel free to plan around that. @fsaudm A100s are not supported because this models requires FP8 tensor cores. @mphilippnv which version of vLLM are you using? You might need to update to v0.6.6 or higher.

Using v0.6.6

EDIT: Apologies, I was using 0.6.2. Redeploying helm chart with 0.6.6.post1. Will see how it goes.

@fsaudm
Copy link

fsaudm commented Dec 31, 2024

Any knowledge of a working example of serving deepseekv3 on A100s with vLLM? I'll try later, but any hints or help is very much appreciated

@JamesBVMNetwork
Copy link

Hi everyone,
I’m encountering the following error when trying to run the image vllm/vllm-openai:v0.6.6.post1 on a node equipped with 8x H100 SMX GPUs:

ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250102-072212.pkl): functional_call got multiple values for keys ['mlp.experts.e_score_correction_bias', 'mlp.gate.e_score_correction_bias'], which are tied. Consider using tie_weights=False
2025-01-02T15:22:12.753719474Z 

Here’s the command I used:

--model deepseek-ai/DeepSeek-V3-Base \
--tensor-parallel-size 8 \
--disable_log_requests \
--uvicorn_log_level error \
--max-model-len 16384 \
--cpu-offload-gb 400 \
--max_num_seqs 1 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--enforce-eager

Does anyone have suggestions or solutions for resolving this issue?

Thanks in advance!

@glowwormX
Copy link

Hi everyone, I’m encountering the following error when trying to run the image vllm/vllm-openai:v0.6.6.post1 on a node equipped with 8x H100 SMX GPUs:

ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250102-072212.pkl): functional_call got multiple values for keys ['mlp.experts.e_score_correction_bias', 'mlp.gate.e_score_correction_bias'], which are tied. Consider using tie_weights=False
2025-01-02T15:22:12.753719474Z 

Here’s the command I used:

--model deepseek-ai/DeepSeek-V3-Base \
--tensor-parallel-size 8 \
--disable_log_requests \
--uvicorn_log_level error \
--max-model-len 16384 \
--cpu-offload-gb 400 \
--max_num_seqs 1 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--enforce-eager

Does anyone have suggestions or solutions for resolving this issue?

Thanks in advance!

I've had this problem, too. Is there a solution?

@ishaandatta
Copy link

I've had this problem, too. Is there a solution?

Was getting this error- got resolved by removing cpu offloading... hoping for an explanation.

Also, any suggestions to increase token throughput & context length.
We're stuck at 6 tokens/second, max 10k context length despite 1600GB VRAM.
I am currently running with tensor+pipeline parallelism on 5 Nodes (4x A100 80GB each). The vm are without Infiniband.

Would having Infiniband (i.e. higher inter-node bandwidth & lower latency) be the main solution to increase token throughput? And for context length > 40k, how much more VRAM would be required..?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new model Requests to new models performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

7 participants