Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Guard for negative counter metrics to prevent crash #10430

Merged

Conversation

tjohnson31415
Copy link
Contributor

@tjohnson31415 tjohnson31415 commented Nov 18, 2024

I'm not sure how it happens, but we have observed crashes when running vLLM in online model due to a negative value being sent to increment a Prometheus counter:

ERROR 11-15 03:59:17 engine.py:165] ValueError('Error in model execution: Counters can only be incremented by non-negative amounts.')
ERROR 11-15 03:59:17 engine.py:165] Traceback (most recent call last):
ERROR 11-15 03:59:17 engine.py:165]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 11-15 03:59:17 engine.py:165]     return func(*args, **kwargs)
ERROR 11-15 03:59:17 engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 11-15 03:59:17 engine.py:165]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/worker/model_runner.py", line 1697, in execute_model
ERROR 11-15 03:59:17 engine.py:165]     model_input.async_callback()
ERROR 11-15 03:59:17 engine.py:165]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/utils.py", line 1150, in weak_bound
ERROR 11-15 03:59:17 engine.py:165]     unbound(inst, *args, **kwargs)
ERROR 11-15 03:59:17 engine.py:165]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/llm_engine.py", line 1243, in _process_model_outputs
ERROR 11-15 03:59:17 engine.py:165]     self.do_log_stats(scheduler_outputs, outputs, finished_before,
ERROR 11-15 03:59:17 engine.py:165]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/llm_engine.py", line 1581, in do_log_stats
ERROR 11-15 03:59:17 engine.py:165]     logger.log(stats)
ERROR 11-15 03:59:17 engine.py:165]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/metrics.py", line 519, in log
ERROR 11-15 03:59:17 engine.py:165]     self._log_prometheus(stats)
ERROR 11-15 03:59:17 engine.py:165]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/metrics.py", line 478, in _log_prometheus
ERROR 11-15 03:59:17 engine.py:165]     self._log_counter(self.metrics.counter_generation_tokens,
ERROR 11-15 03:59:17 engine.py:165]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/metrics.py", line 429, in _log_counter
ERROR 11-15 03:59:17 engine.py:165]     counter.labels(**self.labels).inc(data)
ERROR 11-15 03:59:17 engine.py:165]   File "/opt/vllm/lib64/python3.12/site-packages/prometheus_client/metrics.py", line 313, in inc
ERROR 11-15 03:59:17 engine.py:165]     raise ValueError('Counters can only be incremented by non-negative amounts.')
ERROR 11-15 03:59:17 engine.py:165] ValueError: Counters can only be incremented by non-negative amounts.

This PR adds a check on the value of the counter before calling the prometheus client to avoid the crash, but the root cause of the negative value needs more investigation.

FIX #6642

#6325 is related and shows the same error.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@tjohnson31415 tjohnson31415 force-pushed the negative-counter-metric branch from 4102ffd to 338740a Compare November 18, 2024 18:52
@prashantgupta24
Copy link
Contributor

prashantgupta24 commented Nov 18, 2024

Not sure if it's worth adding a test to test_cancellation within tests/async_engine/test_async_llm_engine.py (or other cancellation tests) to make sure metric readings are recorded correctly in case of cancellations?

@tjohnson31415 tjohnson31415 marked this pull request as ready for review November 18, 2024 22:31
@DarkLight1337
Copy link
Member

Let's fix #6325 in another PR.

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) November 19, 2024 03:28
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 19, 2024
@DarkLight1337 DarkLight1337 merged commit 272e31c into vllm-project:main Nov 19, 2024
62 of 64 checks passed
mikejuliet13 pushed a commit to mikejuliet13/vllm that referenced this pull request Nov 19, 2024
coolkp pushed a commit to coolkp/vllm that referenced this pull request Nov 20, 2024
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
mfournioux pushed a commit to mfournioux/vllm that referenced this pull request Nov 20, 2024
rickyyx pushed a commit to rickyyx/vllm that referenced this pull request Nov 20, 2024
tlrmchlsmth pushed a commit to neuralmagic/vllm that referenced this pull request Nov 23, 2024
@tjohnson31415 tjohnson31415 deleted the negative-counter-metric branch November 25, 2024 16:19
sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: error Counters can only be incremented by non-negative amounts. in metrics module
3 participants