大模型（vllm引擎推理）运行时间久了显存会增加且无法下降 #2639

turndown · 2024-12-09T01:47:51Z

System Info / 系統信息

操作系统"openEuler 20.03 (LTS-SP3)
Cuda V12.5.40
conda虚拟环境Python 3.11.9
transformers 4.46.3
vllm 0.6.4.post1

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

docker / docker
pip install / 通过 pip install 安装
installation from source / 从源码安装

Version info / 版本信息

Release: v0.15.3+1.g26b5097

The command used to start Xinference / 用以启动 xinference 的命令

HF_ENDPOINT=https://hf-mirror.com XINFERENCE_HOME=/data/inference/.xinference GRADIO_DEFAULT_CONCURRENCY_LIMIT=10 XINFERENCE_MODEL_SRC=modelscope nohup xinference-local --host 0.0.0.0 --port 30002 --log-level debug > output.log 2>&1 &

Reproduction / 复现过程

运行方式如下，没有加额外的参数：

最开始启动大模型后占用显存卡2卡3各32g左右：

但是运行了几天并通过dify调用后，显存从32g逐渐上升到37g，现在到了39g，并且没有下降的现象：

有几个问题请教下：
1.大模型加载后只是初步加载显存，后续使用是不是随着并发量的增多，显存也会波动？还是说一直会上升？
2.显存波动上升的范围能否做限制，如果没限制是不是最后会溢出崩溃，如何让显存能降回来？

Expected behavior / 期待表现

调用少了之后，显存回归正常状态。

turndown · 2024-12-11T02:02:10Z

help，没有人遇到这类似的问题吗0.0

948024326 · 2024-12-12T01:08:40Z

help，没有人遇到这类似的问题吗0.0

请问有方法解决嘛

XprobeBot added the gpu label Dec 9, 2024

XprobeBot added this to the v1.x milestone Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

大模型（vllm引擎推理）运行时间久了显存会增加且无法下降 #2639

大模型（vllm引擎推理）运行时间久了显存会增加且无法下降 #2639

turndown commented Dec 9, 2024

turndown commented Dec 11, 2024

948024326 commented Dec 12, 2024

大模型（vllm引擎推理）运行时间久了显存会增加且无法下降 #2639

大模型（vllm引擎推理）运行时间久了显存会增加且无法下降 #2639

Comments

turndown commented Dec 9, 2024

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Expected behavior / 期待表现

turndown commented Dec 11, 2024

948024326 commented Dec 12, 2024