Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CoreMessageReceiver.returnMesagesLoopDaemon is being executed too often causing higher CPU consumption #14550

Closed
3 tasks done
ryan-murphy-sc opened this issue Aug 28, 2020 · 12 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. in progress question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team. Service Bus Track 1

Comments

@ryan-murphy-sc
Copy link
Contributor

Hi. This is a cross-post from Azure/azure-service-bus-java#405 since I don't know if anyone is looking at those, and we are having the issue with service bus 3.x, which is in this repo.

Describe the bug
CoreMessageReceiver.returnMesagesLoopDaemon is scheduled to be executed every millisecond even if there are neither prefetched messages nor pending receivers, causing unnecessary higher CPU usage.

We migrated our clients from a JMS/QPID approach to this SDK and our server is now idling at about 50% CPU without any traffic.

Exception or Stack Trace
N/A

To Reproduce
Although 50% CPU involves many Message Receivers across multiple applications on the server, higher than expected CPU usage can be seen with a handle of Message Receivers.

The Message Receivers are kept open to retrieve for messages periodically, due to some paradigm issues with Message pump approach.

Code Snippet
ClientFactory.createMessageReceiverFromEntityPath(getMessagingFactory(), entityPath, receiveMode);

Expected behavior
Expected behaviour is to have lower idle CPU usage when there are no messages being received.

Setup (please complete the following information):

  • OS: RHEL 7.8, and macOS 10.13, Window 10
  • IDE : N/A
  • Version of the Library used: azure-servicebus:3.4.0

Additional context

Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • Bug Description Added
  • Repro Steps Added
  • Setup information Added
@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Aug 28, 2020
@joshfree joshfree added Client This issue points to a problem in the data-plane of the library. Service Bus Track 1 labels Aug 28, 2020
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Aug 28, 2020
@joshfree
Copy link
Member

Hi @ryan-murphy-sc someone from the Service Bus team will follow up with you shortly on this v3 issue.

/cc @yvgopal @hemanttanwar @conniey @YijunXieMS

@ramya-rao-a ramya-rao-a added the Service Attention Workflow: This issue is responsible by Azure service team. label Oct 15, 2020
@ghost
Copy link

ghost commented Oct 15, 2020

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jfggdl, @axisc.

@mrtnmch
Copy link

mrtnmch commented Nov 26, 2020

Hello, may I ask what's the status of this issue? We're getting affected by the bug as well. In the screenshot I am attaching, there are two identical builds of one of our services - the second one has got Service Bus disabled via configuration (ie. there are no listeners created; the library itself is still included and active).

When the listeners are created, there's basically a busy waiting in CoreMessageReceiver#returnMesagesLoopDaemon which causes the CPU to be constantly under load. This can be easily verified by enabling trace logging in the library - there's a log message basically every millisecond (or every #RETURN_MESSAGES_DAEMON_WAKE_UP_INTERVAL, to be more precise).

Thank you.

image

CC @axisc @shankarsama @jfggdl

@yvgopal
Copy link
Member

yvgopal commented Dec 2, 2020

Let me reproduce and see how much of CPU it is taking. When I ran it myself sometime back, I didn't see any significant additional CPU usage. Of course, significant means different things to different people. In both cases, how may receivers were there in your application for which you captured CPU? TraceLogging is not the best way to see if it is taking lot of CPU. And also it will come into play only for receivers that receive very little number or no messages.

@ryan-murphy-sc
Copy link
Contributor Author

We have about 100 receivers throughout our app server. A number of them receive very few messages (1/hour and even 1/day). We're trying to consolidate to reduce the overhead caused by idle receivers, but it poses some design challenges to do so, so it's going slowly. Thanks.

@yvgopal
Copy link
Member

yvgopal commented Dec 2, 2020

I have just tested it on my desktop. I created a java program that runs with 1000 receivers not receiving any messages and nothing else, the CPU usage is between 3% and 9%, averaging 6%. I don't really understand the 50% claim made with just 100 receivers. That 50% likely includes many other things.

I don't think it is consuming too much CPU. You are free to reproduce this with a standalone app that just creates some receivers and check the CPU usage consumed by the app.

Another point I want to make is that the daemon doesn't technically run every 1 millisecond. It just sleeps for 1 millisecond after the daemon is run once, before running it again. You can guess how many CPU cycles it consumes.

In my test run on my desktop, when I change the sleep duration to 10 milliseconds from 1 millisecond with the same 1000 receivers, CPU usage dropped to average 3%. I can do that change. But I am not sure if this is the cause of your high CPU issues.

@ryan-murphy-sc
Copy link
Contributor Author

The 50% contained other workers as well, and not only ASB Receiver threads, although they were a big portion of that number. This was running with the JVM pinned to a single core of an Azure A2 processor, so we're pretty resource constrained. Your workstation probably has more horsepower.
But anyways, if my performance claims are too vague and unscientific then maybe focus on the much better testing done by @mxmxcz . I'll also see if I can get more conclusive results as well

@yvgopal
Copy link
Member

yvgopal commented Dec 3, 2020

@ryan-murphy-sc @mxmxcz The reason I am saying it might not be taking much of CPU is because your high CPU issue may not be resolved even if I changed that daemon sleep time from 1 millisecond to 100 milliseconds.

@mrtnmch
Copy link

mrtnmch commented Dec 3, 2020

@yvgopal Thank you for your answers. In the original screen I posted, the app was connected to just 1 subscription (with 1 listener and 1 instance). AFAIK there were none or close-to-none messages at the moment, which is fairly common in our use-case.

I created a quick PoC as you said: https://github.com/mxmxcz/servicebus-test. It's a bare Java app containing just the Service bus library in the latest version, not even Spring. To amplify the effect on the local (much more powerful) machine, it creates several listeners to the topic's subscription. On my laptop, it averages at about 6 % - but bear in mind that it's a high-end machine with i7-9750H CPU 12 cores @ 2.60GHz; our pods in k8s are way less powerful for obvious reasons. I also tried to run the app in several (10?) instances, each with 1 listener, which had a similar effect on some of them (these were running at ~8 %, the rest at ~2 %); I didn't find a way to reproduce this scenario reliably, though. There were no messages throughout the whole testing; I was also the only one connected to the subscription.

sb

I absolutely agree with you on

Another point I want to make is that the daemon doesn't technically run every 1 millisecond. It just sleeps for 1 millisecond after the daemon is run once, before running it again. You can guess how many CPU cycles it consumes.

but when the daemon does nothing (which is the case when there're no messages, I suppose), the actual work will take much less than 1 millisecond (x << 1ms), therefore it will be run every "1ms + x". I completely understand that logging messages are by no means a reliable way to debug the problem, but it helps to uncover what's going on.

I don't understand the approach to threading in the library, but is it really necessary to keep polling every x milliseconds, instead of using some blocking mechanism and wake the thread only when needed?

@yvgopal
Copy link
Member

yvgopal commented Dec 3, 2020

@mxmxcz Thanks for testing it. I put that 1 millisecond sleep only after running it for sometime to assess CPU impact. The reason I took this daemon approach is to solve a producer-consumer problem without having to block a thread. A blocking queue would solve it for me, but always blocks a thread. Imagine a 1000 receivers and each receiver instance has one thread blocked in it. I considered multiple ways and decided a daemon is better. May be there is another way I can solve it. As long as CPU impact is not bad and code is stable, I thought it was OK to have a daemon.
I can change it to 10 milliseconds in the next release, which will further reduce the CPU usage. But it may increase max receive latency to 10 millisecond instead of the current 1 millisecond. That shouldn't make much of a difference.

Having said that, I am not treating it as a bug that needs fixing now. It will be low priority for us and I will consider increasing the sleep time.

@yvgopal yvgopal closed this as completed Dec 3, 2020
@ryan-murphy-sc
Copy link
Contributor Author

@yvgopal will there be a separate github issue for the enhancement or can we continue to use this one and change the title to indicate it's an enhancement? It's an important issue for us and I want to keep some visibility on it.

@ryan-murphy-sc
Copy link
Contributor Author

@yvgopal also out of curiosity, is this daemon loop even relevant for Message Receivers that don't use prefetch and don't use MessageAndSessionPump? Our receivers rarely use prefetch and never use MessageAndSessionPump. So this entire daemon seems unnecessary.

Also I do feel this was closed prematurely. It's a very real issue impacting Azure Service Bus customers.

openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-java that referenced this issue May 26, 2021
Remove x-ms-flatten setting for resource properties (Azure#14550)

* Disabled flatten setting for properties

* added forceDelete parameter for orchestrator resource

* remove forcedelete

* config

Co-authored-by: msyyc <[email protected]>
openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-java that referenced this issue May 26, 2021
Remove x-ms-flatten setting for resource properties (Azure#14550)

* Disabled flatten setting for properties

* added forceDelete parameter for orchestrator resource

* remove forcedelete

* config

Co-authored-by: msyyc <[email protected]>
@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. in progress question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team. Service Bus Track 1
Projects
None yet
Development

No branches or pull requests

7 participants