Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support zero-knowledge scaling of RTPEngine #892

Open
wants to merge 51 commits into
base: master
Choose a base branch
from

Conversation

guss77
Copy link

@guss77 guss77 commented Dec 11, 2019

The current model for creating RTPEngine clusters, as documented in the Redis keyspace notification wiki page, require a static configuration where the exact topology of the cluster is fully known ahead of time, and a Redis keyspace is allocated for each cluster member.

Here we introduce a new way to handle RTPEngine clusters that allows a cluster to scale out and in automatically without pre-configuring each cluster member with all the known addresses of all other members. This configuration supports a completely stateless message distribution (for example as that offered by a layer 3 network load balancer) by remembering the network identity of a call owner in the Redis-distributed data structure and allowing a node that receives a command for a session created by another node to answer with the owner's network identity without knowing it ahead of time.

This patch, though minimal, is probably a bit too hackish - and is not configurable at all - so it is presented here as a basis for discussion. Please let me know what you think.

@guss77
Copy link
Author

guss77 commented Dec 11, 2019

I forgot to note the expected configuration required for this setup to work:

The configuration should have:

  • One local interface (with advertised address or not, label is optional)
  • Redis server configured with a single known database number
  • The same database used as the subscribed keyspace

Example:

[rtpengine]
interface=172.30.1.1!12.34.56.78
listen-ng=2223
redis=redis-server:6379/1
sip-source=true
num-threads=2
subscribe-keyspace=1

@rfuchs
Copy link
Member

rfuchs commented Dec 12, 2019

Ok, so to summarise: Normally rtpengine stores the name (label) and the ID of the local interface used for each local port into Redis, and then uses that information to restore the same status when a call is restored. This patch adds a plain interface address to the call structure, and when present, rtpengine uses that address instead of the actual interface address when communicating to the outside world, while the actual UDP socket is still bound to whatever local interface is configured. Did I get it right so far?

@guss77
Copy link
Author

guss77 commented Dec 12, 2019

Yes, that's basically it. This works well for me where my nodes are running behind a 1-to-1 NAT where what we store and advertise is an IP address not listed on the local interface. I don't think it matters for this behavior change (I assume that without the NAT public IP address, advertised_address will be the actual local IP address).

The main issue is that the stored interface address belongs to another node (it is only used for "foreign calls"), so the RTP socket is actually bound on another machine.

@rfuchs
Copy link
Member

rfuchs commented Dec 13, 2019

Ok. I don't have a problem adding such a feature, but obviously it would have to be optional and configurable. Also I'd like to have it in a more generic way, meaning without the requirement for a special case config. This can probably be achieved by having the "connection address" stored not per call, but rather per socket (or per media), and perhaps add a flag to configured interfaces to signal that this is an interface that can receive non-local network traffic.

@guss77
Copy link
Author

guss77 commented Dec 18, 2019

Sorry for taking the time to respond. I want to advance this feature request, but am running into difficulties:

  1. When you say "per socket (or per media)", what do you mean in regards to the Redis call data structure? Do you mean the thing that redis.c calls "sfd-%u" or "stream-%u" ? And why would I want to store it there - don't all streams go through the same server?
  2. We seem to be having some trouble with SRTP, which needs more investigation and likely more work - I'll update when we have a solution, which will likely require storing more information in Redis.

@rfuchs
Copy link
Member

rfuchs commented Dec 18, 2019

Re 1. Yes all streams go through the same server, but a single call can have ports/sockets allocated from multiple different interfaces. This feature currently requires that only a single interface is configured. I'd like to lift this restriction and make it work also in scenarios with multiple interfaces present.

SRTP should not really be affected by this. If it works without this feature, it should work with it. At least I see no reason why it shouldn't 😄

@guss77 guss77 force-pushed the feature/zero-knowledge-scaling branch from 8ff568b to 5f62bfd Compare January 1, 2020 14:01
@guss77
Copy link
Author

guss77 commented Jan 1, 2020

@rfuchs - I'd appreciate your input on a problem I'm with this feature: when rtpengine loads a foreign call from Redis, as part of loading the data structure into memory, it also tries to open all the ports described in the sfds. I've been getting error sometimes with that process as when multiple servers are handling calls, they can open the same ports numbers for different calls (on different servers), so when a foreign call is loaded, it might describe ports that are already in use by local calls.

I tried to figure out how the static configuration - with multiple interfaces configured on the local machine (describing interfaces on remote machines), but I can't figure out how you avoid that problem in the static configuration either. I'd appreciate any pointers.

The last commit - that tries to just not allocate the ports - was causing later code to SIGSEGV or SIGABRT, so I removed it.

@guss77
Copy link
Author

guss77 commented Jan 5, 2020

Possibly related to my last question, you said:

perhaps add a flag to configured interfaces to signal that this is an interface that can receive non-local network traffic

Would that mean that __get_consecutive_ports would know that it doesn't need to allocate a port for a foreign call? if so - how can it propagate that knowledge? all the other things after the call to __get_consecutive_ports in redis_sfds assume that there is stream_fd that holds a real OS socket.

@guss77
Copy link
Author

guss77 commented Jan 6, 2020

Ok, I've been spending a lot of time with this issue the last week, as I still can't get this feature to work, so here's a summary of the status - what I'm looking for and what I understood rtpengine does. Hopefully it will inform our communications.

Current clustering method

I believe I failed to understand what the current cluster/redis support in rtpengine does. The way I understand it now is that the cluster setup is supposed to be as follows:

  1. A fixed number of servers are set up and configured. The IP addresses and interfaces for all of these are known.
  2. The servers network stack is configured with all of the IP addresses of all of the servers! (I'm pretty sure I'm correct about this, though the Wiki page doesn't mention that at all).
  3. Some kind of network router, such as ipvsadm knows which server "owns" which IP address, and when a server fail - to route traffic destined to the IP of the failed server, to another server - where that IP is already configured on the TCP stack, so would just work.
  4. A Redis "database" number is pre-allocated for each server.
  5. rtpengine is configured with all of that information.

When a new offer is received, the primary rtpengine that received it sets up the call data and stores it in their Redis database. All other servers (I'm going to call them "alternates" from now on) read the data and set up "foreign call" data, while also opening all the listening ports required, on their "copy" of the primary's network interface (i.e. on the IP alias that was pre-configured). The alternates can do that because the primary broadcasted the logical name of the network interface it runs the call on, they have a configuration that identifies which IP that name is globally attached to (i.e. its the same name to the same IP in all rtpengine instances), and that IP is already configured in some network interface in the OS (step 2 above).

While concurrent calls with different primaries may re-use port numbers, because each logical interface has its own port pool, it works fine because "conflicting" ports are bound on different IP addresses and the OS network stack handles that nicely.

  • Pros: in-call failover: as an rtpengine instance crashes, assuming the network router knows to detect that and "migrate" the IP address, calls can continue with possibly no interruption.
  • Cons:
    • As long as the primary is running, all commands must be directed at the primary and alternates are only used for fallbacks in case of an rtpengine failure, so the load balancer needs some kind of session tracking, probably call-id based, so it is a layer 5 LB.
    • There is a hard limit on the number of servers by the number of supported Redis databases (IIRC its 16)
    • There is a soft limit by the complexity of the network configuration and needing to manage it.

Suggested "zero-knowledge" clustering method.

What I'm trying to achieve is a dynamic cluster that supports:

  1. Auto-scaling: we shouldn't need to pre-allocate all the resources - running on modern cloud infrastructure should allow us to change resource allocation in runtime to track demand. This means we can't pre-configure a set of IP addresses
  2. Work with minimal configuration dumb load balancers, such as those offered by cloud infrastructure, that have no session tracking and no failover logic.

Problems:

  • we can't pre-configure network interfaces (theoretically we can use a pool of known IP addresses, and re-use them when scaling out, but that limits scaling and makes it much more costly both in configuration complexity and actual money: pre-allocated IPs have a cost, so I rather not do that if I can).
  • we can't pre-allocate Redis databases per-server.
  • we can't assume intelligent routing: servers will have to answer for commands where they are not the primary! (this is the main problem I'm stuck on atm).

The setup I'm looking for will be like this:

  1. Servers can come and go with completely arbitrary configuration.
  2. Each server gets configured in runtime to know just their local IP addresses.
  3. A layer 3 or 4 simple load balancer routes control commands atrbitrarily (i.e. with no stickiness)
  4. All servers share a single Redis "database"
  5. rtpengine is configured just with the local interfaces and the single Redis database for both writing and reading. RTP is routed directly from UAs to servers without load balancing.

When a new offer is received, the primary rtpengine that received is setes up the call data, records the IP addresses it uses for each of the two media objects and stores it in the Redis database (actually in the SDP's "<family> SPACE <address>" format, because there's an easily refactored piece of code that does it like that and we always need both data items at once). All alternates read the data and set up "foreign call" data, including the primary's network interface addresses, but do not open listening ports - because we worry about port pool conflicts and we can't listen on "other servers IPs" anyway.

When an alternate receives a command (because the load balancer is dumb), it generates the correct SDP for continuing to route the media through the primary, because it has all the information: it knows all the port numbers and the controlling (foreign) network interface's familty and address.

  • Pros:
    • configuration is much simpler and less limited.
    • should work well with and take advantage of modern cloud infrastructure.
  • Cons:
    • If a server goes down, all calls it is running are lost. For my requirements this is a minor problem in and of itself (assuming a server crash), but it is a major problem with scale-in: a scale-in action has to take the server out of the command load balancer and then wait until all calls terminate before killing the server. This is not a trivial setup but seems quite doable with relatively simple control logic that we will create later (and I won't mind sharing).

The problems

  • Storing the controlling network interface address data in Redis so alternates can "fake" SDP responses - this is done and I've settled on storing it in the media structure (it is currently in another branch and not in this PR, but I'll push it here shortly).
  • Supporting "foreign ports" for socket_t (that we shouldn't bind to): when the Redis code loads the foreign call, it needs to create the data structures to faithfully reproduce the original port numbers in the SDP for "alternate answers", but we can't actually bind to the ports because (a) its useless and (b) it might conflicts with ports for local calls. I've added a field and some support code for using socket_t as a "not an actual bound port", and it seems to be working fine, but the logic for deciding when to use it probably breaks the static cluster use case - I probably want to add a new configuration field to trigger this behavior?
  • Reusing a "cached" network address when rewriting the SDP - done.
  • Updating the primary data when an alternate handles an answer: the "leg B" endpoint is known only in the answer, and without letting the primary know about it, the primary won't send RTP to B's UA. This is the main issue I'm trying to tackle now. The alternate doesn't write things back into Redis, but if it did - the primary will just ignore it as it ignores notification on its ow calls, and even if it didn't - we'd run into all kinds of update races and data conflicts. I think the best way would be to add another pub/sub for "back updates" where the alternate can send just the UA endpoint data from the answer - when a primary sees a "back update" for its own call, it can update the endpoint, keep calm and carry on.

Any pointers, screams or oy-veys are welcome :-)

guss77 added 3 commits January 6, 2020 22:41
So each media can have different network interfaces
ports owned by another server are not bound locally, we just know what they are
Currently only missing endpoint addresses, but I need more
@guss77
Copy link
Author

guss77 commented Jan 7, 2020

I was mistaken about the alternates not updating Redis - they do. So I'm trying to implement "don't ignore foreign updates, try to integrate them" for primaries. The issues I'm having are:
a. seems that a lot of the changes in the alternate write-back are due to json_restore_call not generating a 100% faithful call structure. For example, my alternate logs show notifications such as Opus doesn't support a ptime of 32000 ms; using 60 ms instead during foreign call loads, when ptime is actually set to "0" in the original JSON write, then in the writeback is is set to "20" (?). So I need to be very careful with what I choose to update.
b. It is not easy to update things: the call data management is done throughout performing SDP rewrite logic, so without duplicating various data access codes from throughout call.c, I doubt I can do all the needed updates, and even then - I'm missing a lot of know how about call.c internals to make it safe. I'll concentrate on getting my use-case working, and we'll see from there.

@rfuchs
Copy link
Member

rfuchs commented Jan 7, 2020

For the underlying RTP sockets, I see two distinct cases:

  1. No advertised address is used and change of endpoint address from one instance to another is handled by the load balancer through a changed network route (e.g. a host route). A naive approach would be to store the address into Redis and have other instances simply open the port on that address while ignoring their own interface configuration. This would require the net.ipv4.ip_nonlocal_bind sysctl to be enabled (there's an IPv6 equivalent too) as well as an external mechanism to bring the address up locally when needed.
  2. Advertised address is used and change of endpoint address from one instance to another is handled by the load balancer through a changed NAT destination. In this case, the port is bound to whatever interface address rtpengine is configured and it's the advertised address that's stored into Redis instead. Care must be taken that there is no overlap in address/port ranges, so this probably doesn't scale indefinitely.

In both cases, the logical interface name should still be honoured, so that multiple interfaces can be supported, even though in the first case it will likely be of little consequence.

For the Redis part of the puzzle, I can't provide much insight since this sort of active/active failover is contributed code and we're not using rtpengine in this mode ourselves. I would assume that the distinction between "foreign" and owned call must remain so that standby instances don't act on these calls (e.g. doing timeouts or writing to Redis) by default. I would try to avoid sending signalling to instances that don't own the call if at all possible as that would complicate things considerably. Instead, receiving signalling (or RTP for that matter) should switch a call from "foreign" to "owned," giving that instance ownership of that call, presuming that whichever instance owned the call previously is not there any more. This switching of ownership could also happen explicitly instead, e.g. through some sort of command to rtpengine "take over all calls owned by instance X." This would require giving each instance a unique identifier and storing the owner ID for each call into Redis.

@guss77
Copy link
Author

guss77 commented Jan 7, 2020

  1. No advertised address ... This would require the net.ipv4.ip_nonlocal_bind sysctl to be enabled

Very interesting - I wasn't aware of that capability.

I would try to avoid sending signalling to instances that don't own the call if at all possible as that would complicate things considerably.

Yea, I'm looking at that complication now :-/ . One option we were looking at is "just" writing the session aware signaling load-balancer, to make sure all signaling work well, but we concluded that writing a reliable load-balancer for auto scaling is harder then utilizing the existing Redis active/active support. This assessment may have been incorrect.

This would require giving each instance a unique identifier and storing the owner ID for each call into Redis.

AFAIK, each instance already have a unique ID: it adds it to the SDPs it replies with. I think storing it in Redis would be a good idea, but I didn't find where it is stored in memory.

@rfuchs
Copy link
Member

rfuchs commented Jan 7, 2020

AFAIK, each instance already have a unique ID: it adds it to the SDPs it replies with. I think storing it in Redis would be a good idea, but I didn't find where it is stored in memory.

That's static const str instance_id in sdp.c. It's generated randomly during startup. You can use this, but then it should be made somewhat more constant across restarts.

@guss77
Copy link
Author

guss77 commented Jan 9, 2020

That's static const str instance_id in sdp.c. It's generated randomly during startup. You can use this, but then it should be made somewhat more constant across restarts.

Constant across restart does not help with my "immutable replaceable instances" scenario, but I guess I can make this as a hash on configured interfaces hardware IDs?

@guss77 guss77 force-pushed the feature/zero-knowledge-scaling branch from 9aa77c5 to e767e59 Compare January 29, 2020 14:06
otherwise, lets try to avoid parsing "" as "0.0.0.0:0", as that doesnt make sense.
by having codec.c create the encoded format in the same way it will later parse it, then just use it in redis.c
we need better resolution to properly ignore high speed redis updates being sent on the shared channel (now that updates are sent both ways)
refactor the handling of the redis "set" event so that if we answer for a foreign call and send an update, we can also ignore our update. In the future we may be able to completely drop the json_restore_call() path and use the more flexible redis-json implementation for all reading
@guss77
Copy link
Author

guss77 commented Feb 16, 2020

So the last push (and I'll fix the merge conflicts in a bit) works for me: in a cluster of rtpengine instances that share a Redis database, all servers keep an up to date copy of the the call data structure for all calls, and when a non-owner of a call gets a command, they can answer it truthfully and if any new data needs to be generated - it is propagated correctly back to the owner so the owner can handle RTP streams with full knowledge of what answers where provided by alternates.

There are still things that can be improved: even after 14abc07 there are some problems with payload descriptor encoding/decoding; I want to rebuild the JSON structure so it is hierarchical like the actual call structure and use numeric encoding for numbers instead of a set of string arrays; a couple of places where I had to copy code from call.c can be refactored into reusable functions.

But that being said, I believe this feature is ready to merge (after I fix the conflicts)

@rfuchs
Copy link
Member

rfuchs commented Feb 24, 2020

Well this has grown into a rather large PR. Are you able to refactor/squash/rebase this into commits that are easier to review?

@guss77
Copy link
Author

guss77 commented Mar 18, 2020

Sorry for the late reply - I was sick at home for a couple of weeks and then covid19 happened :-/

I'll rebase the commits for something more manageable in the next few days.

@guss77
Copy link
Author

guss77 commented Mar 18, 2020

Also, @rfuchs , we encountered another edge case that breaks this work, that I would appreciate your feedback on:

Looking at call_interfaces.c:975 - this gets triggered if we get a new offer command for an existing call. From the comments it looks like this behavior is related to the Redis-based clustering code (the original multi-database hard-coded configuration mode) and is about some kind of recovery from a networking failure between the SIP entity and RTPEngine. In our use case this gets triggered when the SIP entity is handling a media re-INVITE - it starts a new media negotiation on the same call ID.

This is obviously not what the original code was intended to handle, but I've added some code in the last commit to have other servers honor an RTPEngine that "takes ownership" of a call like that.

My question is - shouldn't we do this always (why only in cluster mode)? If a new offer is received for an existing call, why not always destroy the old call and start a new call?

@rfuchs
Copy link
Member

rfuchs commented Mar 18, 2020

Well, no, because rtpengine keeps call state intact across re-invites, and handles changes in the media descriptions gracefully (or at least attempts to). This is things like local ports allocated, packet counters, protocols used, codecs, crypto info, etc etc.

I can't really comment on this particular code (nor anything else from the active/active mechanism) as it's been contributed by @smititelu and @inf265 and so I don't know why in this case the call is destroyed and re-created.

@guss77
Copy link
Author

guss77 commented Mar 19, 2020

In that case, I believe the correct approach would be to remove the code that has special handling of "offer for a known foreign call" added by @lbalaceanu - with the code in this PR, the reasoning in the comment is no longer needed: the "fallback server" will just return the same reply the original server sent and there is no loss of data and no need to restart anything.

I'm going to do that and run some tests before pushing the change here.

guss77 added 2 commits March 22, 2020 12:07
…n existing call id"

This reverts commit 9b9a165. After removing "call restart in case of repeat offer", this is no longer needed.
Reverts commit e043670. If we get a second offer, and it is a retry, new redis data allows us to reply with the original server reply, so no need to restart the session.
@lbalaceanu
Copy link
Contributor

This is obviously not what the original code was intended to handle, but I've added some code in the last commit to have other servers honor an RTPEngine that "takes ownership" of a call like that.

Hi @guss77 , @rfuchs ,

I haven't been able to browse through the actual commits proposed by this pull request, but as far as I remember the original code you mention is related to this scenario: the primary rtpengine (owner) receives an offer from a SIP proxy, but its response doesn't reach this proxy in time. The proxy then sends the same offer to another rtpengine (alternate). Obviously there must be a way to re-conciliate the 2 rtpengines.

The code in the original Redis fixed keyspace setup might seem at times cumbersome, but this is because over time it got tested and repaired and obviously different corner cases have been addressed. I consider that this progress should be kept and new features come as extensions or additions to working code.

Thank you

@guss77
Copy link
Author

guss77 commented Jul 1, 2020

The code in the original Redis fixed keyspace setup might seem at times cumbersome, but this is because over time it got tested and repaired and obviously different corner cases have been addressed. I consider that this progress should be kept and new features come as extensions or additions to working code.

Hi @lbalaceanu - sorry it took me so long to respond. I've been dealing with other things that took attention away from this project.

Regarding your comments, to the best of my knowledge an ability, all the new behavior I introduced up till now isn't supposed to change the original behavior - though this is based only on my reading of the code and guesses as to the intent of the original author, as the system architecture for which the "Redis fixed keyspace setup" was built is not documented anywhere and I have no way to actually test that the original behavior is intact.

I would love to be able to re-produce such a set up to allow me to test such assumptions. Are you the original author and/or have a relevant setup that we can discuss?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants