From 9f4ccec76135083c96d15fbbade5eda7a2321bf1 Mon Sep 17 00:00:00 2001 From: youkaichao Date: Tue, 16 Jul 2024 09:45:30 -0700 Subject: [PATCH] [doc][misc] remind to cancel debugging environment variables (#6481) [doc][misc] remind users to cancel debugging environment variables after debugging (#6481) --- docs/source/getting_started/debugging.rst | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/source/getting_started/debugging.rst b/docs/source/getting_started/debugging.rst index 0d03fe93adc61..2aa52e79888a3 100644 --- a/docs/source/getting_started/debugging.rst +++ b/docs/source/getting_started/debugging.rst @@ -19,9 +19,6 @@ If you have already taken care of the above issues, but the vLLM instance still - Set the environment variable ``export NCCL_DEBUG=TRACE`` to turn on more logging for NCCL. - Set the environment variable ``export VLLM_TRACE_FUNCTION=1``. All the function calls in vLLM will be recorded. Inspect these log files, and tell which function crashes or hangs. - .. warning:: - vLLM function tracing will generate a lot of logs and slow down the system. Only use it for debugging purposes. - With more logging, hopefully you can find the root cause of the issue. If it crashes, and the error trace shows somewhere around ``self.graph.replay()`` in ``vllm/worker/model_runner.py``, it is a cuda error inside cudagraph. To know the particular cuda operation that causes the error, you can add ``--enforce-eager`` to the command line, or ``enforce_eager=True`` to the ``LLM`` class, to disable the cudagraph optimization. This way, you can locate the exact cuda operation that causes the error. @@ -67,3 +64,7 @@ Here are some common issues that can cause hangs: If the script runs successfully, you should see the message ``sanity check is successful!``. If the problem persists, feel free to `open an issue on GitHub `_, with a detailed description of the issue, your environment, and the logs. + +.. warning:: + + After you find the root cause and solve the issue, remember to turn off all the debugging environment variables defined above, or simply start a new shell to avoid being affected by the debugging settings. If you don't do this, the system might be slow because many debugging functionalities are turned on.