Fix 841 Part-2: C-Tree deepening #849

hylkevds · 2024-08-03T08:55:44Z

This is a prototype / proof-of-concept, not ready for merging!

Issue #841 is about C-Tree performance. This prototype demonstrates both one of the issues and a possible solution to it.

The C-Tree is a tree that stores the children for a node in an Array. Because the tree is lock-free, it must make a copy of a node to add a child to that node. This involves copying the array with child-nodes. For flat trees, where the nodes have very many child-nodes (more than 1k) this is exceptionally slow. This slowness makes the lock-free solution worse, since the longer the copy action takes, the more likely it is that another thread has modified the node in the meantime, and thus the more likely it is that the copy action has to be re-done.

This PR artificially deepens the tree, to avoid having nodes with too many child-nodes. It does this by chopping long tokens into shorter tokens. The maximum token length is currently hard-coded in a constant, but should be configurable for a real implementation.

Testing with a numeric token means that a 1-character maximum token length results in a maximum of 10 child-nodes per node (0-9) while a 4 character maximum token length would result in a maximum of 10000 child nodes (0-9999).

Some initial testing on a single-threaded insert:

Max length, time for 500k subs, subs/second
1, 3777 ms, 132380/s
1, 3928 ms, 127291/s
1, 3441 ms, 145307/s
2, 1592 ms, 314070/s
2, 1585 ms, 315457/s
2, 1688 ms, 296209/s
3, 1788 ms, 279642/s
3, 1553 ms, 321958/s
3, 1560 ms, 320513/s
4, 1047 ms, 477555/s
4, 888 ms, 563063/s
4, 964 ms, 518672/s
5, 6039 ms, 82795/s
5, 6800 ms, 73529/s
5, 6849 ms, 73003/s
6, 195648 ms, 2556/s
6, 201065 ms, 2487/s
6, 200722 ms, 2491/s

From the initial numbers it is clear that

choosing the max token length too short is bad for performance
after more than 10k children the performance plummets dramatically

So there clearly is an optimum. Where this optimum lies is probably very much depending on the topic structure and the number of cores used.

hylkevds · 2024-08-03T14:19:32Z

@andsel, before investigating further I'd like your opinion on this direction of artificial tree-deepening...

The open issue is the list of subscriptions inside one node. When many clients subscribe to the same topic the same issue exists with a list that becomes too large. That would be trickier to solve, probably by building a tree based on the client-id?

andsel · 2024-08-04T16:32:23Z

@hylkevds I'll review all the information, but I'm little bit slow.

andsel · 2024-08-17T16:22:28Z

First of all, thank's a lot for your great investigation on very wide trees.

@andsel, before investigating further I'd like your opinion on this direction of artificial tree-deepening...

I think this is the right path to investigate it, ok that it's artificial, but test the tree operations varying one of the dimensions (the arity of the nodes, the branching factor).

When many clients subscribe to the same topic the same issue exists with a list that becomes too large

Given that the problem is the copy, what if we use the principle (non mutable structure) that CTries embodies?
I mean we could use a CopyOnWriteArrayList to contain the children list, so that the copy is a zero-copy and only once the list content is changed, a new copy is done.
What do you think?

hylkevds · 2024-08-18T12:27:39Z

When many clients subscribe to the same topic the same issue exists with a list that becomes too large

Given that the problem is the copy, what if we use the principle (non mutable structure) that CTries embodies? I mean we could use a CopyOnWriteArrayList to contain the children list, so that the copy is a zero-copy and only once the list content is changed, a new copy is done. What do you think?

It would help when we have a topic with may subscribers, and a new child-topic is added to that topic.
When adding or removing a subscriber, the CopyOnWriteArrayList won't help, since then a copy is still made of the entire list. At the same time, the CopyOnWriteArrayList itself is thread-safe and uses a lock, meaning the CTree is no longer fully lock-free. Even if that lock is only used when writing...

On the other hand, when returning the list of subscriptions from a node, a copy is also made! This could be improved with the CoW list, since we could return (a read-only wrapper of) the list itself instead of a copy.

Quite a few aspects to this issue...

andsel · 2024-08-19T13:36:51Z

Right, correct the copy step on the CopyOnWrite list is heavy like the actual copy.
Maybe we could use a persistent data structure, instead of reimplementing it we could use PCollections: https://pcollections.org/ .
WDYT?

hylkevds · 2024-08-19T13:56:45Z

Right, correct the copy step on the CopyOnWrite list is heavy like the actual copy. Maybe we could use a persistent data structure, instead of reimplementing it we could use PCollections: https://pcollections.org/ . WDYT?

Oh, that looks really interesting!
When PCollections do what they promise that would fit our use-case perfectly. Especially the fact that
PCollection y2 = y.plus(e); does not change y, so it's compatible with our CTree! CopyOnWrite lists are not compatible with the CTree concept.

That should be really simple to drop into the current code.
I just need to extend the performance test to also test tree-access, and removing everything again.

hylkevds · 2024-08-20T13:11:59Z

My first test have some really interesting results. I've put them in a spreadsheet:
Spreadsheet

The first two sheets are with many clients on few topics. This tests the CNode.subscriptions performance.

PCollections make adding subscriptions much faster
Strangely there is not much effect on removing?
Reading is slower! Because iterating over all items in the PCollection is slower I think, this needs more investigation.

The third and fourth sheets test the CTree with a deep set of subscriptions. For this case there is little difference between the two, which is expected, since the CTree implementation is pretty optimised for deep trees. But performance also doesn't degrade!

The last two sheets test the CTree with a really shallow and wide tree. Here performance is very similar, but slightly better with PCollections. The different becomes very clear when not artificially deepening the tree much. Without PCollections performance absolutely plummets when nodes get more then 9999 children. With PCollections perfomance still degrades, but much gentler, and only after nodes get more than 99999 children.

andsel · 2024-08-24T16:21:21Z

@hylkevds could you specify what is in column A (I think it's the branching factor) and in the row 3 ( I suspect it's the number of clients).

From a first sight the about the Many clients few topic plain list vs PersistentCollection:

the addition case performs better with plain list instead of PersistentCollection
the removing case the performance are almost the same
the read case is more or less the same

If I'm not wrong, does the plain list perform equal or better than the persistent collection?

hylkevds · 2024-08-24T17:29:13Z

@hylkevds could you specify what is in column A (I think it's the branching factor) and in the row 3 ( I suspect it's the number of clients).

Column A is the maximum token length. When Topic.java cuts a topic into tokens it normally does this on the / characters. In this experiment I sub-divide the resulting tokens into sub-tokens with this as a maximum length. This happens here:

moquette/broker/src/main/java/io/moquette/broker/subscriptions/Topic.java

Line 143 in ca474d3

int end = Math.min(start + MAX_TOKEN_LENGTH, l);

Shorter tokens means a deeper tree, longer tokens a wider tree. A maximum token length of 1 makes many nodes with only 1 child, hence the bad performance.

From a first sight the about the Many clients few topic plain list vs PersistentCollection:
* the addition case performs better with plain list instead of PersistentCollection

No, lower is better. The numbers are how long it took to add 50_000 subscriptions, in milliseconds.
The deep-tree and flat-tree tests are done with 500_000 subscriptions.

* the removing case the performance are almost the same

Yes, this surprises me. Maybe the test hits a worst-case scenario for the tree implementation of the persistent collections...

* the read case is more or less the same

Indeed, since it essentially copies the entire list of clients into an array. We need to look if we can avoid this copy operation, since the persistent collections are persistent, so should not need to be copied.

If I'm not wrong, does the plain list perform equal or better than the persistent collection?

The Persistent Collections are much better when adding, by a factor of 6.
The Persistent Collections are currently slightly slower when reading, though we need to see if we can optimise away the expensive copy operation here.
The Persistent Collections are about equally slow when removing.

hylkevds · 2024-08-24T17:50:34Z

An important detail is that the performance of the PCollections doesn't worsen with increasing thread-count, like the array-implementation does.

hylkevds · 2024-08-25T09:24:39Z

I found the issue with the remove speed: cnode.containsOnly(clientId) and cnode.contains(clientId) used an iterator to find the subscription for the current user. But, since a user can only have one (normal) subscription for a topic, we can use a PC-Map.

That brings the remove speed up to add-speed levels: ~20 - 30 ms for 20k subscriptions.

Now to see if the read can be improved, and then go over the shared subscriptions.

hylkevds · 2024-08-25T13:32:20Z

@andsel, Question: A big performance bottleneck is the method selectSubscriptionsWithHigherQoSForEachSession(List subscriptions) ?

It turns a potential massive list of subscriptions into a map, and then back into a list, in the hope that this reduces the number of messages to be sent out...
But according to the spec, duplicate messages for overlapping topics are allowed:

When Clients make subscriptions with Topic Filters that include wildcards, it is possible for a Client’s subscriptions to overlap so that a published message might match multiple filters. In this case the Server MUST deliver the message to the Client respecting the maximum QoS of all the matching subscriptions [MQTT-3.3.5-1]. In addition, the Server MAY deliver further copies of the message, one for each additional matching subscription and respecting the subscription’s QoS in each case.

This method is also the only reason a copy of the subscription list must be made. If we change the behaviour to send a message to each topic, we can pass the original subscription lists without ever making a copy. Even the selection of which client to message in case of shared subscriptions, could be done "late" without making a copy of the subscriptions list or map.

What are your thoughts on this?

andsel · 2024-08-31T14:55:15Z

Hi @hylkevds
I think the core point and design decision here is:

In addition, the Server MAY deliver further copies of the message

At the time I opted for the one that move less traffic, so select only the highest QoS subscription for each client session.

I think we could loosen this requirement, but we have to effectively understand how frequently happen an overlapping happen. Or could be something configurable with a setting. I could think about a user of the broker that till the previous version of such change received always (in case of overlapping subscription for same session) the highest QoS one, but after this change it receives multiple publishes for same topic at different QoS.
Maybe on this you have a more hands-on experience being a user of the broker. WDYT?

hylkevds · 2024-09-02T07:17:51Z

In our use cases we never have overlapping topics, since the OGC SensorThings API doesn't allow wildcards. So I think I'm not the right person to ask :D. But you are absolutely correct in saying that changing behaviour between versions is not good.
Fortunately it should not be that hard to make it configurable.

But, before going into all that, I first did some tests to see if it is even relevant. So I make a quick prototype and am working on the performance analysis. I hope to finish that soon!

hylkevds · 2024-09-15T17:17:46Z

To test if the duplicate check makes a difference, I did some tests.

Instead the CTree returning a List<Subscription> with the matching subscriptions, I've made it return a SubscriptionCollection Object that only holds references to the subscriptions lists of each matching CNode. This SubscriptionCollection can provide an Iterator that will go over the sub-lists in turn. Since PCollections are (essentially) immutable, this means that no lists need to be copied. The original collections of subscriptions can be handed down to where they are needed.

I fixed the remove speed by changing the subscription collections to a Map based on the ClientId. Since a Client can only have 1 normal subscription per topic, putting them in a Map makes the most sense.

After that I re-ran the performance tests and updated the GSheet.

The Tests

There are three test setups, for each setup it gathers numbers for creating, reading and removing subscriptions:

1 Topic, 20k subscriptions: One of the worst-case scenarios.
- Old + configurable tree depth: 1 topic
  This doesn't test the CTree, but the handling of subscriptions inside a CNode. This test makes 1 topic and adds 20k subscriptions to this single topic. For this test we don't expect any effect of the tree-deepening, since there is only 1 node.
  - Adding and Removing is negatively affected by thread-count (Up-trending line left-to-right).
  - reading is tree-depth dependent (the lines are vertically separated), since each tree level makes a new copy of the subscriptions list. So deeper trees are less efficient.
  - With 4 or more threads, reading no longer speeds up. Probably because memory band-with limits.
- Updated with PCollections, Write and Read changes: 1 Topic; Update
  - Adding and Removing is dramatically improved and no longer strongly affected by thread-count. With PCollections there is no longer a copy of the entire subscription list, meaning the update is faster, and thus threads are less likely to run into conflicts.
  - Reading is behaving very differently:
    - No longer tree-depth dependent: since no copies of the subscription list is made, it doesn't matter how deep the tree is.
    - Strong speed improvement with increasing thread-count. Since no copies are made, and reading is done directly on the original data, doubling the thread-count doubles the speed.
    - Slower in low-thread-counts with shallow-trees, but faster in high-thread counts, especially with deeper trees. Iterating over PCollections is inherently slower than iterating over array-lists.
Flat Tree: The other worst-case scenario. One million topics, each with 1 subscriber, all on the same level: <topicNr>-mainTopic
- Old + configurable tree depth: Flat Tree
  - Adding is very much influenced by the artificial tree-deepening. Without tree-deepening the test never finished, while too much tree-deepening makes things also slower. The optimum seems to be around 1000 children per node. With a flatter tree there is an extreme negative effect with more threads.
  - Removing has no negative effects of a flat tree, probably because chaning a node to a TNode happens in-place. Deeper trees are slower than flatter trees. Though the number of TNodes may have effects we can't see in these tests. More threads also make removing faster, though after 4 threads there is no more performance gain from adding threads.
  - Reading is a bit faster in flatter trees, again because each tree-level causes a copy of the subscription-list.
- Updated with PCollections, Write and Read changes: Flat Tree; Update
  - Adding is much less influenced by tree-depth, because the subscriptions are always in a tree :) Deepening the tree with CNodes is more efficient in a high-thread-count situation. Deepening the tree with PCollections more efficient in a low-thread-count situation. Most importantly, there is no longer a total collapse in performance.
  - Removing is slower with overly-deep trees, though the effect quickly diminishes.
  - Reading is faster than the original situation, though not by very much. The reduced copy-load is partly eaten up by the more complex iteration over the PCollection.
Deep Tree: This is more-or-less the ideal use-case for MQTT. The performance differences between the PCollection situation and the old situation are small, though in all cases the PC version performs slightly better.

andsel · 2024-09-29T08:42:45Z

As I can understand from the charts and the description, I would summarize as follow:
PCollection vs plain collection is almost the same if not little better in any use case, unless the case is with high number of subscriptions per node, where the PCollection perform significantly better in any of the operations: read/remove/write .

Does you test consider also the multiple matching for same client and overlapping subscriptions case?

hylkevds · 2024-09-30T05:53:01Z

The "Update" tests no longer de-duplicate messages for clients, so clients with multiple matching subscriptions get multiple messages. I've not yet added an option to turn that back on. In these tests, that would mainly influence the "1 Topic" tests, since that is the only one with multiple subscriptions per topic.

I also wonder if it might help to cache the subscriptions for a topic in a List, to speed up subsequent iterations over that subscriptions list, until a subscription is added or removed again.

andsel · 2024-10-12T16:02:48Z

I also wonder if it might help to cache the subscriptions for a topic in a List,

I would avoid to add too much complexity if not needed. Maybe that would help for ultimate performance gain.
As the PR is today, it proved that the avoidance of copy data greatly improve performances.

hylkevds mentioned this pull request Aug 3, 2024

High CPU usage when run Ctrie.insert function #841

Open

hylkevds force-pushed the fix_841-2_CTree-Deepening branch from c28290f to 808b32d Compare August 3, 2024 13:24

hylkevds force-pushed the fix_841-2_CTree-Deepening branch from ca474d3 to 0642876 Compare August 25, 2024 09:25

hylkevds added 7 commits October 12, 2024 20:37

Experiment with artificial tree-deepening

d92cba2

Fixed matching + and # containing topics

2e236c4

Disable really slow test for now

edefcd0

Added Read and Remove performance tests for CTree

9213bc5

Test use of PCollections

ee09f72

Improved subscription remove performance

a62e472

Rewrote subscription fetching to not copy lists of subscriptions

b136fc5

hylkevds force-pushed the fix_841-2_CTree-Deepening branch from e913366 to b136fc5 Compare October 13, 2024 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix 841 Part-2: C-Tree deepening #849

Fix 841 Part-2: C-Tree deepening #849

hylkevds commented Aug 3, 2024

hylkevds commented Aug 3, 2024

andsel commented Aug 4, 2024

andsel commented Aug 17, 2024

hylkevds commented Aug 18, 2024

andsel commented Aug 19, 2024

hylkevds commented Aug 19, 2024

hylkevds commented Aug 20, 2024

andsel commented Aug 24, 2024

hylkevds commented Aug 24, 2024

hylkevds commented Aug 24, 2024

hylkevds commented Aug 25, 2024

hylkevds commented Aug 25, 2024

andsel commented Aug 31, 2024

hylkevds commented Sep 2, 2024

hylkevds commented Sep 15, 2024

andsel commented Sep 29, 2024

hylkevds commented Sep 30, 2024

andsel commented Oct 12, 2024

Fix 841 Part-2: C-Tree deepening #849

Are you sure you want to change the base?

Fix 841 Part-2: C-Tree deepening #849

Conversation

hylkevds commented Aug 3, 2024

hylkevds commented Aug 3, 2024

andsel commented Aug 4, 2024

andsel commented Aug 17, 2024

hylkevds commented Aug 18, 2024

andsel commented Aug 19, 2024

hylkevds commented Aug 19, 2024

hylkevds commented Aug 20, 2024

andsel commented Aug 24, 2024

hylkevds commented Aug 24, 2024

hylkevds commented Aug 24, 2024

hylkevds commented Aug 25, 2024

hylkevds commented Aug 25, 2024

andsel commented Aug 31, 2024

hylkevds commented Sep 2, 2024

hylkevds commented Sep 15, 2024

The Tests

andsel commented Sep 29, 2024

hylkevds commented Sep 30, 2024

andsel commented Oct 12, 2024