Pull consumer with max bytes setting causes high CPU usage [v2.10.18] #1718

atombender · 2024-09-13T18:04:34Z

Observed behavior

In production, we noticed that NATS would periodically spike in CPU usage despite no sign of increased message volume or any other metric that seemed relevant.

We were able to narrow it down to setting jetstream.PullMaxBytes() with the pull consumer. If a message entered the stream that exceeded this size, the client would get a 409 Message Size Exceeds MaxBytes error from the server and apparently retry. Removing the max bytes limit fixed our issue.

Also note that the error does not bubble up to the consumer's error handler callback. We were using jetstream.ConsumeErrHandler() to log errors. However, after a while this callback is called with the error nats: no heartbeat received.

Expected behavior

NATS should not use this much CPU.

Server and client version

NATS 2.10.18
nats.go v1.34.1

Host environment

Linux, Kubernetes.

Steps to reproduce

Full reproduction here.

Start a consumer with jetstream.PullMaxBytes() passed to consumer.Consume() using a low maximum size.
Publish a message exceeding the size.

The text was updated successfully, but these errors were encountered:

atombender · 2024-09-13T20:23:50Z

Also reproduced with NATS 2.10.19 and nats.go 1.37.0.

atombender · 2024-09-13T20:25:03Z

Also to add: It might be a client issue, but I opened it here because it's unclear.

wallyqs · 2024-09-13T20:31:46Z

to clarify the high cpu is on the app or the nats-server?

atombender · 2024-09-13T20:35:00Z

Both. What appears to be happening is that the pull is retried repeatedly, which increases the CPU both on the client and server sides.

ripienaar · 2024-09-13T21:13:49Z

With what frequency? Do you have any delay between retries? Maybe a backoff?

atombender · 2024-09-13T21:21:34Z

This is a single call to consumer.Consume(), so I am not in control of the loop, NATS is. You may want to look at the repro code. 🙂

wallyqs · 2024-09-13T21:58:36Z

thanks for sharing the repro @atombender we'll take a look

MauriceVanVeen · 2024-09-13T22:05:53Z

Thanks for the repro, could hear from my laptop's fan speed that it's working 😅

Seems the server is spammed with MSG.NEXT requests that immediately fail due to the MaxBytes being set, so it's essentially looping forever:

[#464089] Received on "$JS.API.CONSUMER.MSG.NEXT.my-stream.my-consumer" with reply "_INBOX.GCjLIXdluJQzEFU7AYNgDy"
{"expires":300000000000,"batch":1000000,"max_bytes":1000,"idle_heartbeat":30000000000}


[#464090] Received on "_INBOX.GCjLIXdluJQzEFU7AYNgDy"
Nats-Pending-Messages: 1000000
Nats-Pending-Bytes: 1000
Status: 409
Description: Message Size Exceeds MaxBytes

nil body

The client should at least log the error being hit, and should wait in-between retries probably.

But possibly this condition will never clear due to the message just staying, making the consumer stall. What should we do in this case/what is the intended behaviour?

MauriceVanVeen · 2024-09-13T22:06:52Z

Possibly the server could also protect itself more by not immediately sending the error back, which would also slow down the client.

atombender · 2024-09-13T22:11:36Z

The documentation is not clear about whether the max bytes limit is a hard or soft limit. It is apparent from the observed behavior that it is a hard limit that will prevent the consumer from consuming the next message for ever.

In other words, it's possible to write a consumer that just stops being able to consume (until the offending message is deleted/expired). That condition should at least be detectable at the consumer level.

It's debatable whether the client should retry, given that in a JetStream context it can't expect the next call to work — it's going to be stuck retrying until the blocking message is gone.

jnmoyne · 2024-09-20T21:22:41Z

IMHO this should be moved to a nats.go issue rather than nats-server, the problem being with the Consume() client-side code retrying to get the message.

How it should behave instead is open for discussion but I would say it should not retry and should instead signal the client app (which for Consume() is not obvious how to do) or log the error. In my opinion when a message in the stream is larger than the NEXT request's max bytes, retrying will not do anything other than fail again (at least until the message in question gets removed from the stream). The issue is with the client application not specifying a max byte high enough (or if it's the message that's 'too large' the issue is more with the admin not setting a max message size in the stream's config).

atombender · 2024-09-21T22:57:02Z

Agreed.

Jarema · 2024-09-23T08:52:55Z

We're having some ongoing discussions how to generally improve the max_bytes behaviour in case of single message exceeding the batch config, but agree that this specific issue is client related. Moving it to nats.go

atombender added the defect Suspected defect such as a bug or regression label Sep 13, 2024

derekcollison assigned neilalexander and MauriceVanVeen Sep 13, 2024

wallyqs changed the title ~~Pull consumer with max bytes setting causes high CPU usage~~ Pull consumer with max bytes setting causes high CPU usage [v2.10.18] Sep 13, 2024

wallyqs assigned piotrpio Sep 13, 2024

Jarema transferred this issue from nats-io/nats-server Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull consumer with max bytes setting causes high CPU usage [v2.10.18] #1718

Pull consumer with max bytes setting causes high CPU usage [v2.10.18] #1718

atombender commented Sep 13, 2024

atombender commented Sep 13, 2024

atombender commented Sep 13, 2024

wallyqs commented Sep 13, 2024

atombender commented Sep 13, 2024

ripienaar commented Sep 13, 2024

atombender commented Sep 13, 2024

wallyqs commented Sep 13, 2024

MauriceVanVeen commented Sep 13, 2024

MauriceVanVeen commented Sep 13, 2024

atombender commented Sep 13, 2024

jnmoyne commented Sep 20, 2024

atombender commented Sep 21, 2024

Jarema commented Sep 23, 2024

Pull consumer with max bytes setting causes high CPU usage [v2.10.18] #1718

Pull consumer with max bytes setting causes high CPU usage [v2.10.18] #1718

Comments

atombender commented Sep 13, 2024

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

atombender commented Sep 13, 2024

atombender commented Sep 13, 2024

wallyqs commented Sep 13, 2024

atombender commented Sep 13, 2024

ripienaar commented Sep 13, 2024

atombender commented Sep 13, 2024

wallyqs commented Sep 13, 2024

MauriceVanVeen commented Sep 13, 2024

MauriceVanVeen commented Sep 13, 2024

atombender commented Sep 13, 2024

jnmoyne commented Sep 20, 2024

atombender commented Sep 21, 2024

Jarema commented Sep 23, 2024