Benchmarking with C++ Bindings #531

CyberPoN-3 · 2024-11-27T15:34:26Z

I'm trying to write a benchmark to evaluate performance improvement over the first iceoryx (version 2.0.5) using C++ bindings.

The benchmark consists in sampling repeatedly the time spent between the publish and the corresponding event notification on the consumer, running in two different processes (not two different threads of the same process). The iceoryx1 test uses WaitSet. After having studied the new examples of iceoryx2, I've found that the communication pattern that I would like to test is the 'event' one, so without having to explicitly poll for the new data.

I took cxx/event_multiplexing example as reference code, but I noticed that apparently I cannot send data through that mechanism, so the strategy I implemented is to first publish the data as in cxx/publish_subscribe example and then using the event communication pattern to wake up the consumer process that will poll only one time in order to get the published data (as explained in the cxx/publish_subscribe example).
The operations executed in the time lapse recorded are:

Publisher side:

the publish of the data struct
the event trigger

Subscriber side:

the wakeup from the trigger
the 1 time poll to get the data
the time needed to compute time diff with the timestamp recorded in the sent struct by the publisher

The results I got are comparable without substantial differences w.r.t the first iceoryx, as the following graphs show:

So I would like to as you if there's a smarter way to conduct the benchmark keeping the thing on two different processes and eventually if sending data through events is possible. If that's possible, an example would be very useful.

Thanks in advance,
Matteo

elfenpiff · 2024-11-27T18:09:32Z

@CyberPoN-3 Awesome, this is a benchmark missing in iceoryx2. Would you be open to contribute the benchmark? The perfect place would be benchmarks/cxx and then I will later move the current benchmarks into benchmarks/rust.

eventually if sending data through events is possible

This will never be possible on that level. We explicitly separated the payload delivery (pubsub or request response) from the wakeup mechanism in iceoryx2. The reason for separation are

in an unseparated form the publisher will always trigger the other side with a syscall which can cause overhead
complex events are impossible to realize like: trigger other side only when 2 samples from service A and one sample from service B is delivered and its Friday
- with classic iceoryx you would constantly wake up the other side, causing unnecessary scheduling overhead etc
ability to explicitly poll for data without any syscall when a higher executor instance woke you up

You implemented it exactly like it was intended to be used. Yesterday we also added a more complex example describing this setup, take a look at: https://github.com/eclipse-iceoryx/iceoryx2/tree/main/examples/cxx/event_based_communication

So I would like to as you if there's a smarter way to conduct the benchmark

The benchmark we did was to measure a ping-pong setup.

You have two processes P1 and P2
A pubsub and event service S12 for communication from P1 to P2
A pubsub and event service S21 for communication from P2 to P1

For your benchmark you would do the following:

Send a sample from P1 to P2 via the publisher of S12
Send a notification from P1 to P2 via the notifier of S12
Wait on P2 until you received the notification
Send a sample from P2 to P1 via the publisher of S21
Send a notification from P2 to P1 via the notifier of S21
Wait on P1 until you received the notification
GoTo step 1

You repeat this cycle 1.000.000 times an measure the runtime T. Then you divide the measured runtime T by 1.000.000 times 2. Times 2 because you had here a two way communication, once from P1 to P2 and then back from P2 to P1.

The reason why we did this is, that the underlying system time call clock_gettime may have a huge overhead compared to the overall iceoryx2 communication. On some AMD systems, we observed a runtime of around 1us, which is hard when the latency of iceoryx2 is 100ns, so this call can increase the runtime by a factor of 10. But when you repeat the communication 1.000.000 times and call clock_gettime only twice (once at start and once at the end) the overhead influence of that call becomes irrelevant.

eventually if sending data through events is possible. If that's possible, an example would be very useful.

Btw. in the long-term we want to introduce something like a meta port which contains all ports and does the event notifications for you so that you can enjoy a much simpler API. But this port would have the disadvantages I described above. When ultra-low latency is not a requirement this becomes a nice alternative.

elfenpiff · 2024-11-27T18:11:58Z

@CyberPoN-3 btw. What does the red dotted line in your graphics mean?

CyberPoN-3 · 2024-12-02T08:27:05Z

Hi, sorry for the late reply!

Would you be open to contribute the benchmark?

That's for sure, give me the time to get it written as good as I can, then I'll be pleased to share with you! Also feel free to edit the benchmark as you prefer, since I'm testing a use-case I need for my application, infact that 100us red dashed line was a reference I selected based on my previous benchmarks.

The benchmark we did was to measure a ping-pong setup.

I see, that's a great way to conduct the test and I agree with the strategy you adopted! In my case I would be interested also on benchmarking the readiness of the system when there is a delay between a ping-pong and the next one, in order to simulate what happens if, for example, I publish some data at a certain frequency, like 1Hz, 10Hz, 100Hz etc... That's one of the main focuses of the benchmark I'm doing. Infact, from the actual results I got, it looks like that the lower the frequency between publishes the higher the response time to wake up the subscriber process in front of an event. Before talking about numbers (actually ~50us difference from 100Hz and 1Hz), since I'm pretty new to Iceoryx, I would like to ensure myself to have done my best to get the benchmark written the fastest way possible.

The reason why we did this is, that the underlying system time call clock_gettime may have a huge overhead compared to the overall iceoryx2 communication....

That's really interesting, I'm working on a solution to benchmark the way I need, trying to exclude that huge overhead.

Thanks for sharing! Any advice would be appreciated!

elfenpiff · 2024-12-03T07:58:07Z

@CyberPoN-3

from the actual results I got, it looks like that the lower the frequency between publishes the higher the response time to wake up the subscriber process in front of an event

What you observe here is a delay in the operating system. The lower the frequency is, the longer the receiving side sleeps, and the OS puts a process into a deeper sleep the longer it is inactive. Meaning, processes that are getting activated with a high frequency are getting more often rescheduled and are maybe in a higher priority scheduling queue and inactive process are rescheduled with a lower frequency and are moved to a low priority scheduling queue. Or maybe when there is a priority queue in place the priority of that process decreases over time - depending on the underlying scheduler of the OS.

To have a more responsive system, you could on linux compile your own kernel and configure it with the parameters:

This should increase the frequency with which the scheduler checks for process activities, at the slight cost of an increased CPU load.

CyberPoN-3 · 2024-12-03T08:39:08Z

Thank you for the advices @elfenpiff, I'll try them out!

CyberPoN-3 · 2024-12-03T10:20:40Z

@elfenpiff

To have a more responsive system, you could on linux compile your own kernel and configure it with the parameters:

I'm working on a NVidia Jetson Xavier AGX, actually I've finished right now checking both and it seems they are set by default as you suggested.
By executing uname -a I get: Linux tegra-ubuntu 5.10.216-tegra #1 SMP PREEMPT Wed Aug 28 01:46:00 PDT 2024 aarch64 aarch64 aarch64 GNU/Linux

In order to determine the timer interrupt frequency I followed that guide that gives a small test to unveil the setting (since I don't have a /boot/config- file to read on the NVidia Jetson...) and its result is:
kernel timer interrupt frequency is approx. 4016 Hz or higher.

The lower the frequency is, the longer the receiving side sleeps, and the OS puts a process into a deeper sleep the longer it is inactive.

Actually my setup sees publisher, subscriber and RouDi running on maximum real time priority level (99) on a round robin scheduling scheme (I used chrt for that and obv checked via htop that the priority was set correctly). Other than that I locked CPU frequency at maximum frequency for the Xavier (~2.2GHz, 8 cores) with Performance scaling governor applied and MAXN nvpmodel. The board was completely left unload during the benchmarks and also a run varying roudi/subscriber/publisher priority in [99RT, 100CFS, 119CFS] trying all possible combinations of them has been done and the pattern showing the increasing response time with longer publish periods still shows on.

Is there anything else I can do to dig more on that?
Thanks in advance

elfenpiff · 2024-12-03T10:59:52Z

@CyberPoN-3 I found this answer interesting: https://stackoverflow.com/a/13619750

It stated that round-robin is a suboptimal scheduling algorithm, especially when it comes to processes waiting for IO - exactly what you are doing here. Waiting for an event notification that is sent via a UNIX datagram socket.

Could you try other schedulers? Maybe deadline or CTF will provide better results - I am not an experts on schedulers so those are just wild guesses. I think I would explore schedulers and how they work to optimize the reaction time.

Maybe you are able to decrease the latency of waking up another process but it could come at a high price, higher CPU load and then the actual computation time increases which would then increase the latency of the overall system.

Waking up a process/thread that is in deep sleep always takes a bit longer, since the process and memory has to be reloaded. So the simplest way to avoid a deep sleep is a busy loop - but this will cost massive CPU time.

CyberPoN-3 · 2024-12-03T12:17:13Z

@elfenpiff
Thank you elfenpiff, I'll try more about schedulers in order to see the difference. As soon as I have the results I'll update you!

or CTF will provide better results

What about CTF? Actually I never heard about that and google didn't save me that time xD

elfenpiff · 2024-12-18T11:12:34Z

@CyberPoN-3

What about CTF? Actually I never heard about that and google didn't save me that time xD

My mistake, I mixed something up. I meant the CFS (completely fair scheduler)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking with C++ Bindings #531

Benchmarking with C++ Bindings #531

CyberPoN-3 commented Nov 27, 2024

elfenpiff commented Nov 27, 2024 •

edited

Loading

elfenpiff commented Nov 27, 2024

CyberPoN-3 commented Dec 2, 2024

elfenpiff commented Dec 3, 2024

CyberPoN-3 commented Dec 3, 2024

CyberPoN-3 commented Dec 3, 2024

elfenpiff commented Dec 3, 2024

CyberPoN-3 commented Dec 3, 2024

elfenpiff commented Dec 18, 2024

Benchmarking with C++ Bindings #531

Benchmarking with C++ Bindings #531

Comments

CyberPoN-3 commented Nov 27, 2024

elfenpiff commented Nov 27, 2024 • edited Loading

elfenpiff commented Nov 27, 2024

CyberPoN-3 commented Dec 2, 2024

elfenpiff commented Dec 3, 2024

CyberPoN-3 commented Dec 3, 2024

CyberPoN-3 commented Dec 3, 2024

elfenpiff commented Dec 3, 2024

CyberPoN-3 commented Dec 3, 2024

elfenpiff commented Dec 18, 2024

elfenpiff commented Nov 27, 2024 •

edited

Loading