[BUG] CoreMessageReceiver.returnMesagesLoopDaemon is being executed too often causing higher CPU consumption #14550

ryan-murphy-sc · 2020-08-28T03:01:54Z

Hi. This is a cross-post from Azure/azure-service-bus-java#405 since I don't know if anyone is looking at those, and we are having the issue with service bus 3.x, which is in this repo.

Describe the bug
CoreMessageReceiver.returnMesagesLoopDaemon is scheduled to be executed every millisecond even if there are neither prefetched messages nor pending receivers, causing unnecessary higher CPU usage.

We migrated our clients from a JMS/QPID approach to this SDK and our server is now idling at about 50% CPU without any traffic.

Exception or Stack Trace
N/A

To Reproduce
Although 50% CPU involves many Message Receivers across multiple applications on the server, higher than expected CPU usage can be seen with a handle of Message Receivers.

The Message Receivers are kept open to retrieve for messages periodically, due to some paradigm issues with Message pump approach.

Code Snippet
ClientFactory.createMessageReceiverFromEntityPath(getMessagingFactory(), entityPath, receiveMode);

Expected behavior
Expected behaviour is to have lower idle CPU usage when there are no messages being received.

Setup (please complete the following information):

OS: RHEL 7.8, and macOS 10.13, Window 10
IDE : N/A
Version of the Library used: azure-servicebus:3.4.0

Additional context

Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

Bug Description Added
Repro Steps Added
Setup information Added

joshfree · 2020-08-28T18:27:26Z

Hi @ryan-murphy-sc someone from the Service Bus team will follow up with you shortly on this v3 issue.

/cc @yvgopal @hemanttanwar @conniey @YijunXieMS

ghost · 2020-10-15T05:06:42Z

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jfggdl, @axisc.

mrtnmch · 2020-11-26T12:37:26Z

Hello, may I ask what's the status of this issue? We're getting affected by the bug as well. In the screenshot I am attaching, there are two identical builds of one of our services - the second one has got Service Bus disabled via configuration (ie. there are no listeners created; the library itself is still included and active).

When the listeners are created, there's basically a busy waiting in CoreMessageReceiver#returnMesagesLoopDaemon which causes the CPU to be constantly under load. This can be easily verified by enabling trace logging in the library - there's a log message basically every millisecond (or every #RETURN_MESSAGES_DAEMON_WAKE_UP_INTERVAL, to be more precise).

Thank you.

CC @axisc @shankarsama @jfggdl

yvgopal · 2020-12-02T21:23:14Z

Let me reproduce and see how much of CPU it is taking. When I ran it myself sometime back, I didn't see any significant additional CPU usage. Of course, significant means different things to different people. In both cases, how may receivers were there in your application for which you captured CPU? TraceLogging is not the best way to see if it is taking lot of CPU. And also it will come into play only for receivers that receive very little number or no messages.

ryan-murphy-sc · 2020-12-02T21:40:55Z

We have about 100 receivers throughout our app server. A number of them receive very few messages (1/hour and even 1/day). We're trying to consolidate to reduce the overhead caused by idle receivers, but it poses some design challenges to do so, so it's going slowly. Thanks.

yvgopal · 2020-12-02T23:56:57Z

I have just tested it on my desktop. I created a java program that runs with 1000 receivers not receiving any messages and nothing else, the CPU usage is between 3% and 9%, averaging 6%. I don't really understand the 50% claim made with just 100 receivers. That 50% likely includes many other things.

I don't think it is consuming too much CPU. You are free to reproduce this with a standalone app that just creates some receivers and check the CPU usage consumed by the app.

Another point I want to make is that the daemon doesn't technically run every 1 millisecond. It just sleeps for 1 millisecond after the daemon is run once, before running it again. You can guess how many CPU cycles it consumes.

In my test run on my desktop, when I change the sleep duration to 10 milliseconds from 1 millisecond with the same 1000 receivers, CPU usage dropped to average 3%. I can do that change. But I am not sure if this is the cause of your high CPU issues.

ryan-murphy-sc · 2020-12-03T00:35:11Z

The 50% contained other workers as well, and not only ASB Receiver threads, although they were a big portion of that number. This was running with the JVM pinned to a single core of an Azure A2 processor, so we're pretty resource constrained. Your workstation probably has more horsepower.
But anyways, if my performance claims are too vague and unscientific then maybe focus on the much better testing done by @mxmxcz . I'll also see if I can get more conclusive results as well

yvgopal · 2020-12-03T00:39:15Z

@ryan-murphy-sc @mxmxcz The reason I am saying it might not be taking much of CPU is because your high CPU issue may not be resolved even if I changed that daemon sleep time from 1 millisecond to 100 milliseconds.

mrtnmch · 2020-12-03T08:45:13Z

@yvgopal Thank you for your answers. In the original screen I posted, the app was connected to just 1 subscription (with 1 listener and 1 instance). AFAIK there were none or close-to-none messages at the moment, which is fairly common in our use-case.

I created a quick PoC as you said: https://github.com/mxmxcz/servicebus-test. It's a bare Java app containing just the Service bus library in the latest version, not even Spring. To amplify the effect on the local (much more powerful) machine, it creates several listeners to the topic's subscription. On my laptop, it averages at about 6 % - but bear in mind that it's a high-end machine with i7-9750H CPU 12 cores @ 2.60GHz; our pods in k8s are way less powerful for obvious reasons. I also tried to run the app in several (10?) instances, each with 1 listener, which had a similar effect on some of them (these were running at ~8 %, the rest at ~2 %); I didn't find a way to reproduce this scenario reliably, though. There were no messages throughout the whole testing; I was also the only one connected to the subscription.

I absolutely agree with you on

Another point I want to make is that the daemon doesn't technically run every 1 millisecond. It just sleeps for 1 millisecond after the daemon is run once, before running it again. You can guess how many CPU cycles it consumes.

but when the daemon does nothing (which is the case when there're no messages, I suppose), the actual work will take much less than 1 millisecond (x << 1ms), therefore it will be run every "1ms + x". I completely understand that logging messages are by no means a reliable way to debug the problem, but it helps to uncover what's going on.

I don't understand the approach to threading in the library, but is it really necessary to keep polling every x milliseconds, instead of using some blocking mechanism and wake the thread only when needed?

yvgopal · 2020-12-03T22:02:06Z

@mxmxcz Thanks for testing it. I put that 1 millisecond sleep only after running it for sometime to assess CPU impact. The reason I took this daemon approach is to solve a producer-consumer problem without having to block a thread. A blocking queue would solve it for me, but always blocks a thread. Imagine a 1000 receivers and each receiver instance has one thread blocked in it. I considered multiple ways and decided a daemon is better. May be there is another way I can solve it. As long as CPU impact is not bad and code is stable, I thought it was OK to have a daemon.
I can change it to 10 milliseconds in the next release, which will further reduce the CPU usage. But it may increase max receive latency to 10 millisecond instead of the current 1 millisecond. That shouldn't make much of a difference.

Having said that, I am not treating it as a bug that needs fixing now. It will be low priority for us and I will consider increasing the sleep time.

ryan-murphy-sc · 2020-12-03T22:07:34Z

@yvgopal will there be a separate github issue for the enhancement or can we continue to use this one and change the title to indicate it's an enhancement? It's an important issue for us and I want to keep some visibility on it.

ryan-murphy-sc · 2020-12-03T22:18:20Z

@yvgopal also out of curiosity, is this daemon loop even relevant for Message Receivers that don't use prefetch and don't use MessageAndSessionPump? Our receivers rarely use prefetch and never use MessageAndSessionPump. So this entire daemon seems unnecessary.

Also I do feel this was closed prematurely. It's a very real issue impacting Azure Service Bus customers.

Remove x-ms-flatten setting for resource properties (Azure#14550) * Disabled flatten setting for properties * added forceDelete parameter for orchestrator resource * remove forcedelete * config Co-authored-by: msyyc <[email protected]>

joshfree added Client This issue points to a problem in the data-plane of the library. Service Bus Track 1 labels Aug 28, 2020

ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Aug 28, 2020

joshfree assigned yvgopal Aug 28, 2020

ramya-rao-a added the Service Attention Workflow: This issue is responsible by Azure service team. label Oct 15, 2020

jfggdl assigned axisc and shankarsama Oct 15, 2020

yvgopal added the in progress label Dec 2, 2020

yvgopal closed this as completed Dec 3, 2020

yvgopal mentioned this issue Dec 24, 2020

Vijay idle receiver cpu #18348

Merged

github-actions bot locked and limited conversation to collaborators Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] CoreMessageReceiver.returnMesagesLoopDaemon is being executed too often causing higher CPU consumption #14550

[BUG] CoreMessageReceiver.returnMesagesLoopDaemon is being executed too often causing higher CPU consumption #14550

ryan-murphy-sc commented Aug 28, 2020

joshfree commented Aug 28, 2020

ghost commented Oct 15, 2020

mrtnmch commented Nov 26, 2020 •

edited

Loading

yvgopal commented Dec 2, 2020

ryan-murphy-sc commented Dec 2, 2020

yvgopal commented Dec 2, 2020

ryan-murphy-sc commented Dec 3, 2020

yvgopal commented Dec 3, 2020

mrtnmch commented Dec 3, 2020

yvgopal commented Dec 3, 2020

ryan-murphy-sc commented Dec 3, 2020

ryan-murphy-sc commented Dec 3, 2020

[BUG] CoreMessageReceiver.returnMesagesLoopDaemon is being executed too often causing higher CPU consumption #14550

[BUG] CoreMessageReceiver.returnMesagesLoopDaemon is being executed too often causing higher CPU consumption #14550

Comments

ryan-murphy-sc commented Aug 28, 2020

Additional context

joshfree commented Aug 28, 2020

ghost commented Oct 15, 2020

mrtnmch commented Nov 26, 2020 • edited Loading

yvgopal commented Dec 2, 2020

ryan-murphy-sc commented Dec 2, 2020

yvgopal commented Dec 2, 2020

ryan-murphy-sc commented Dec 3, 2020

yvgopal commented Dec 3, 2020

mrtnmch commented Dec 3, 2020

yvgopal commented Dec 3, 2020

ryan-murphy-sc commented Dec 3, 2020

ryan-murphy-sc commented Dec 3, 2020

mrtnmch commented Nov 26, 2020 •

edited

Loading