From 3fa51174cf0c50250b4b6067cad1eec4b17bb2f8 Mon Sep 17 00:00:00 2001 From: youkaichao Date: Fri, 20 Dec 2024 10:12:36 -0800 Subject: [PATCH] add doc explanation Signed-off-by: youkaichao --- docs/source/getting_started/debugging.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/getting_started/debugging.rst b/docs/source/getting_started/debugging.rst index 7f36d65a227f0..b123960533816 100644 --- a/docs/source/getting_started/debugging.rst +++ b/docs/source/getting_started/debugging.rst @@ -200,3 +200,4 @@ try this instead: Known Issues ---------------------------------------- - In ``v0.5.2``, ``v0.5.3``, and ``v0.5.3.post1``, there is a bug caused by `zmq `_ , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of ``vllm`` to include the `fix `_. +- To circumvent a NCCL `bug `__ , all vLLM processes will set an environment variable ``NCCL_CUMEM_ENABLE=0`` to disable NCCL's ``cuMem`` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in `the RLHF integration `__ and the `discussion `__ .