Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull consumer with max bytes setting causes high CPU usage [v2.10.18] #1718

Open
atombender opened this issue Sep 13, 2024 · 13 comments
Open
Assignees
Labels
defect Suspected defect such as a bug or regression

Comments

@atombender
Copy link

Observed behavior

In production, we noticed that NATS would periodically spike in CPU usage despite no sign of increased message volume or any other metric that seemed relevant.

Screenshot 2024-09-13 at 19 50 48

We were able to narrow it down to setting jetstream.PullMaxBytes() with the pull consumer. If a message entered the stream that exceeded this size, the client would get a 409 Message Size Exceeds MaxBytes error from the server and apparently retry. Removing the max bytes limit fixed our issue.

Also note that the error does not bubble up to the consumer's error handler callback. We were using jetstream.ConsumeErrHandler() to log errors. However, after a while this callback is called with the error nats: no heartbeat received.

Expected behavior

NATS should not use this much CPU.

Server and client version

  • NATS 2.10.18
  • nats.go v1.34.1

Host environment

Linux, Kubernetes.

Steps to reproduce

Full reproduction here.

  1. Start a consumer with jetstream.PullMaxBytes() passed to consumer.Consume() using a low maximum size.
  2. Publish a message exceeding the size.
@atombender atombender added the defect Suspected defect such as a bug or regression label Sep 13, 2024
@wallyqs wallyqs changed the title Pull consumer with max bytes setting causes high CPU usage Pull consumer with max bytes setting causes high CPU usage [v2.10.18] Sep 13, 2024
@atombender
Copy link
Author

Also reproduced with NATS 2.10.19 and nats.go 1.37.0.

@atombender
Copy link
Author

Also to add: It might be a client issue, but I opened it here because it's unclear.

@wallyqs
Copy link
Member

wallyqs commented Sep 13, 2024

to clarify the high cpu is on the app or the nats-server?

@atombender
Copy link
Author

Both. What appears to be happening is that the pull is retried repeatedly, which increases the CPU both on the client and server sides.

@ripienaar
Copy link
Contributor

With what frequency? Do you have any delay between retries? Maybe a backoff?

@atombender
Copy link
Author

This is a single call to consumer.Consume(), so I am not in control of the loop, NATS is. You may want to look at the repro code. 🙂

@wallyqs
Copy link
Member

wallyqs commented Sep 13, 2024

thanks for sharing the repro @atombender we'll take a look

@MauriceVanVeen
Copy link
Member

Thanks for the repro, could hear from my laptop's fan speed that it's working 😅

Seems the server is spammed with MSG.NEXT requests that immediately fail due to the MaxBytes being set, so it's essentially looping forever:

[#464089] Received on "$JS.API.CONSUMER.MSG.NEXT.my-stream.my-consumer" with reply "_INBOX.GCjLIXdluJQzEFU7AYNgDy"
{"expires":300000000000,"batch":1000000,"max_bytes":1000,"idle_heartbeat":30000000000}


[#464090] Received on "_INBOX.GCjLIXdluJQzEFU7AYNgDy"
Nats-Pending-Messages: 1000000
Nats-Pending-Bytes: 1000
Status: 409
Description: Message Size Exceeds MaxBytes

nil body

The client should at least log the error being hit, and should wait in-between retries probably.

But possibly this condition will never clear due to the message just staying, making the consumer stall. What should we do in this case/what is the intended behaviour?

@MauriceVanVeen
Copy link
Member

Possibly the server could also protect itself more by not immediately sending the error back, which would also slow down the client.

@atombender
Copy link
Author

The documentation is not clear about whether the max bytes limit is a hard or soft limit. It is apparent from the observed behavior that it is a hard limit that will prevent the consumer from consuming the next message for ever.

In other words, it's possible to write a consumer that just stops being able to consume (until the offending message is deleted/expired). That condition should at least be detectable at the consumer level.

It's debatable whether the client should retry, given that in a JetStream context it can't expect the next call to work — it's going to be stuck retrying until the blocking message is gone.

@jnmoyne
Copy link
Contributor

jnmoyne commented Sep 20, 2024

IMHO this should be moved to a nats.go issue rather than nats-server, the problem being with the Consume() client-side code retrying to get the message.

How it should behave instead is open for discussion but I would say it should not retry and should instead signal the client app (which for Consume() is not obvious how to do) or log the error. In my opinion when a message in the stream is larger than the NEXT request's max bytes, retrying will not do anything other than fail again (at least until the message in question gets removed from the stream). The issue is with the client application not specifying a max byte high enough (or if it's the message that's 'too large' the issue is more with the admin not setting a max message size in the stream's config).

@atombender
Copy link
Author

Agreed.

@Jarema
Copy link
Member

Jarema commented Sep 23, 2024

We're having some ongoing discussions how to generally improve the max_bytes behaviour in case of single message exceeding the batch config, but agree that this specific issue is client related. Moving it to nats.go

@Jarema Jarema transferred this issue from nats-io/nats-server Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

8 participants