-
Notifications
You must be signed in to change notification settings - Fork 278
Syncd Comparison Logic
This document briefly describes SONiC syncd comparison logic basic concept, compact code flow and other data to understand high level logic.
Syncd is main process in SONiC infrastructure which talks directly to ASIC. It’s compiled against vendor SAI library and forwards orchestration agent SAI commands (via sairedis library) to vendor SAI library.
Syncd also introduces concept of Real ID (RID) and Virtual ID (VID), where RID is SAI object ID returned by vendor SAI library, and VID is SAI object ID used by Syncd and orchestration agent. Both RID and VID are unique, and they map 1 to 1. Values are stored in REDIS database number 1 under “VIDTORID” and “RIDTOVID” entries. When new object ID is created by SAI, Syncd creates new virtual object ID for it. Similar, when RID object is removed, corresponding VID is also removed.
Syncd operates on RID and VID values, orchestration agent is only aware of VID values.
Syncd and orchestration agent (orchagent) can be restarted independently. When Syncd is restarted, RID values may change, so Syncd reads REDIS database, and recreates/configures SAI objects. After executing all needed functions, new VID to RID maps are recreated (old VID are not changed).
Similar when orchestration agent restarts, it should create new ASIC view, and call sairedis apis to configure ASIC. Internally sairedis forwards all API calls to Syncd using REDIS database as proxy. Also, each of quad APIs create/remove/set/get calls, modifies ASIC SAI view which is store in REDIS database number 1. ASIC view contains only VIDs. RID values are only stored in VIDTORID and RIDTOVID maps.
Because we want to keep switch data plane interruption to minimum, sairedis introduces current view and temporary ASIC view concept. General idea is as follows: when orchestration agent restarts, it puts sairedis in INIT_VIEW mode (using custom attribute), this tells sairedis and syncd to perform all ASIC operations only memory (no actual objects are created, except switch object). Each create/set API in INIT_VIEW mode is only preset in memory as temporary ASIC view. No actual objects are created. GET API is working as expected, it’s forwarded to actual ASIC and gets ASIC responses. GET usage is kept to minimum, and it will not work on newly create objects.
After creating temporary view, orchestrating agent tells sairedis to APPLY_VIEW. This will trigger action in syncd to execute comparison logic on current view and temporary view. Comparison logic result is set of ordered SAI API calls which are executed to move ASIC state from current view to temporary view. After that, current ASIC view is removed from REDIS database, and temporary view becomes new current view. Also new VID/RID maps are generated. Comparison logic will try to make minimal changes to current ASIC state, to keep data plane interruption to minimum. There may also be a case when no actual changes are required to ASIC when configuration didn’t change.
INIT_VIEW and APPLY_VIEW commands are synchronous, and they will block until they get valid response from syncd.
Because SAI design uses object (with attributes) concept for configuring switch, and each object uses the same set of quad APIs (create/remove/set/get), we can use that to generate dependency graph and use that graph to help comparison logic find out which objects changed and removed. SAI headers contain metadata description of each objects and have automated system which generates ANSI C metadata information sources. This source metadata is used in comparison logic.
General idea of comparison logic is very simple: take dependency graph of current ASIC view, take dependency graph of temporary ASIC view, compare them and generate SAI commands to move from current view to temporary view.
Implementation of comparison logic is not trivial, and there are some corner cases which needs to be handled. Very depth descriptions of comparison logic is provided in syncd_applyview.cpp source file.
High level comparison logic description. When syncd receives APPLY_VIEW notification from orchestration agent (via sairedis) then following steps take place:
-
RID/VID mappings are read from REDIS
-
Current and temporary view are read from REDIS and two instances of AsicView class are created which contains all objects present in those view. AsicView object can be in 4 different states: not processed, matched, removed, final. After reading current and temporary view all objects are in not processed state.
-
Match OIDs operation is performed. Since some objects may exists in both temporary and current view, and they didn’t changed during orchestration restart (for example port OIDs when doing GET switch port list), then VID of those objects, will be the same in temporary view, and current view which implies that RID values are also the same. Those objects state is moved to “matched” and they are good starting point for comparison logic.
-
Temporary view is populated with existing objects. Since orchestration agent after restart may not query for some existing objects like default trap group, full port list, etc., we need to transfer those existing objects to temporary view. Imagine scenario: after switch create, default vlan 1 exists on switch, but orchestration agent didn’t query for default vlan ID. Then comparison logic should not remove this vlan during transition, that’s why we transfer this vlan object to temporary view. Only objects which are not explicitly removed, are moved to temporary view.
-
Apply view transition is performed (main logic).
-
Switch object is checked. Current implementation supports only 2 scenarios: whether there is no switch present in both views or only 1 switch object exists in each view.
-
Port objects are checked. When performing init/apply view, syncd expects that no changes were made to SAI PORT object, so all the ports will have the same RID value in current view and temporary view. This logic expects that no ports were removed/created when creating temporary view. There is room for improvement here.
-
Process object for view transition is performed on:
- Non-route_entry objects
- Default route_entry objects
- Non-default route_entry objects
Since some vendor SAI implementation have limitations, on which routes are created first (default routes) we make sure that default routes will be processed first before any other routes.
-
Process object for view transition description:
-
Attributes for temporary objects are processed and in any of attributes is OID attribute, then process object for view transition is executed recursively on that OID. Reason for this is, that when processing single object (not processed state) all attributes are checked and all OID attributes which contain valid OID value, needs to be processed first, and they need to be ether in final state or matched state, so current object will only operate on previously processed objects. Since we know that ASIC view is graph without loops (SAI metadata sanity check), this process will always end for some object that don’t contain OID attributes. If processed object is non-object id (like route_entry, fdb_entry, etc.) then next to attributes, all OIDs in XXX_entry structure are also checked whether they were processed first.
-
Logic tries to find current best match object for currently processed temporary object. Based on object type, object type count in both views, and attribute values this logic tries to find best match for current temporary object, there may also be no best match, for example:
-
If current temporary object is next hop, and in current view there is only 1 next hop object, current best match is that object
-
If there is no next hop object in temporary view, there is no current best match
-
If there is more than 1 next hop object I current view, logic is looking into attributes of each object candidate, here for example:
-
if temporary object and current object candidate has the save attribute but different value (for create only attribute) then this candidate is not taken
-
If both objects have the same attribute value, then candidate is written on list of candidates, and at the end only 1 candidate is returned, the one that has most attributes values matching currently processed temporary object
-
If there are more than 1 candidate objects with the same number of matching attributes, then first object with the same number of reference count as temporary object is used.
-
If none of previous conditions are met, random candidate object is selected
-
-
If no candidate object was found, we create new object with attributes of currently processed temporary object, and new action of “create” object is added to ASIC execute action list
-
Dry run object set transition is performed if candidate object were found. Since we have candidate object, and all CREATE_ONLY attributes are the same, then ether candidate object has the same attribute values as temporary object, or attributes are CREATE_AND_SET and they can be updated from current state to temporary state. This is dry run, so no ASIC execute actions are generated, since this logic may fail in the middle if for some reason we can’t bring current object to temporary object state)
-
For each temporary object attribute, we check whether candidate object contains this attribute and whether value is the same, or value is different and can be updated
-
If attribute is not found on candidate object, we check whether attribute is CREATE_AND_SET and if its is, object can be updated on this attribute.
-
All attributes on candidate object that are not present on temporary object are checked (candidate object has more attributes than temporary object), and logic tries to bring those attributes to default value if attribute has default value. All enumeration and integer default values are supported, and some OID default values are supported. There is room for improvement here.
-
-
If set transition failed (for example candidate object couldn’t been updated for some reason to temporary object), current object is removed, and new object is created based on temporary object attributes.
-
Object transition is performed with will generate SET actions on ASIC execute actions list.
-
Both temporary object and candidate object state is moved to FINAL
-
-
-
All not processed objects are removed, since order of removing objects matter, we can’t remove objects which have reference count greater than zero, so first all objects with reference count equals to zero are removed, then process repeats until no objects with zero reference count exists.
-
-
Object status is checked for temporary view and current view. All object statues must be in FINAL state. If some objects were not processed, then this indicates there is some missing logic.
-
Number of OIDs and total number of objects in current view and temporary view is checked and must be the same after transition.
-
List of previously generated SAI operations by apply view transition is executed on real ASIC to transition current view to temporary view. List contains only 3 types of operations: create, remove or set. If a new object is created, it’s RID is mapped to existing VID which was created in INIT_VIEW mode. If object is removed, then it’s RID and VID mapping is removed.
-
Redis data base is updated. Current and temporary view is removed from REDIS database, and temporary view from memory is written to REDIS as current view. RID/VID mappings are updated in REDIS database.
All those steps are synchronous, and they will block until all data is processed.
What is the behavior of sai_bridge_api->create_bridge_port during INIT_VIEW and APPLY_VIEW? Can I get the same m_bridge_port_id as before warm restart? If not, how can I match restored FDB to bridge port?
All create APIs except create_switch are only in memory, no actual object is created on init_view mode. Not sure why you want to match restored fdbs, since when you send apply_view, comparison logic will try best to match existing fdbs and bridge port. As for bridge port id, you will get new VID for it, you should not relay on that you will get the same VID each time. If startup of orchagent is deterministic you probably get the same VID for each created object, but this is not a rule, and you should not relay on that.
For FDB learned from ASIC, I can store and restore in orchagent. During the restore, should I call sai_fdb_api->create_fdb_entry to make it part of comparison logic input data?
Yes, is you learned those fdb events from SAI notifications, you should call create_fdb_entry explicitly, since after restart or init view mode, syncd have no idea about any fdb events.
When Syncd is restarted, RID values may change, so Syncd reads REDIS database, and recreates/configures SAI objects. After executing all needed functions, new VID to RID maps are recreated (old VID are not changed). I thought that when Syncd is restarted, the RID values (returned by SDK originally) is restored by SAI/SDK (for warm-reboot) and does not change. Can someone clarify why RID values "may" change.
RID (real ID received from vendor SAI) may change, it means that syncd assumes those values may be different after restart, for example if you create next hop, etc.
Consider this scenario: you create 4 objects with simplified id’s 1 2 3 4, and later you delete object 3, so now you have 1 2 4, but after syncd restart and recreating objects you will get 1 2 3, for the same objects, of course you could also get 5,6,7 this doesn’t matter for syncd. Only constant we assume will stay the same are port id’s.
Its good assumption that RID values may change, since vendor may change something from version to version of SAI, ex. Add some objects or remove some objects etc., and we don’t want to deal with different corner cases and in addition to that SAI specification don’t say anything about schema of OID values and keeping them the same while restart.
If it’s up to vendor, it could be implemented even using rand() + unique check, but I think most vendors will try to make those RID values the same each restart on the same platform, since deterministic model is easier to track and find bugs, and also some vendors encode some extra information inside OID itself, for example vlan ID could be added to OID lower or higher bits since it only takes 12 bits from OID, but this is totally up to vendor.
When orchestration agent restarts, it should create new ASIC view, and call sairedis apis to configure ASIC. My understanding was that the AsicDB would retain the configuration as Orchagent restart should not affect the Redis DB. Why would any reconfiguration of DB required?
When orchestration agent restarts, App DB should stay constant, but AsicDB can change, since for example when updating orchagent there was a bug fixed, or some optimization was made that create less next hop groups than previous version etc., then compilation from AppDB (REDIS DB number 0) to AsicDB (REDIS DB number 1) will result in different AsicView that was before restart, and at this point after orchagent will compile new AsicView, syncd comparison logic will kick In to move old AsicView to new AsicView. While AppDB will remain unchanged.
The way we modeled vendor SAI is RID is the index to rest of SAI+SDK data for us. So RID is essential part of the syncd state. If upon warm reboot, state is preserved, and there is no replay of configuration from DB, I don't see a case where RID would change.
Yes, this is correct that during worm restart RID values should not change, if they change this will lead to crash/stop syncd. As for changing RID values during cold syncd restart, our code is written in a way that RID values can change, except PORT values, which are currently used as anchor for all other operations of bringing state back after cold start.
My understanding is that with planned warm restart/reboot, the role played by the view comparison logic is for mapping the new Virtual OID allocated by orchagent to real OID from libsai/SDK. Usually no set/del operation should be generated against libsai upon applyView request.
Upon applyview set operation can be made, also del operation can be made like “remove vlan members”.
Why wasn’t view comparison logic put in orchagent? VIDs are consumed by orchagent only. Having view comparison logic in orchagent has a few benefits:
- Simplify the interface between orchagent and syncd.
- Make syncd a shim layer for integrating libsai/SDK only, and stabilize the whole syncd logic.
- Make orchagent unplanned warm restart possible, orchagent doesn’t need to worry about interim virtual OID transition that happens in syncd anymore.
We could put comparison logic in orchagent, but it would cause a log of trouble:
- Orchagent would need to be aware of VID and RID, which syncd is already using that
- Right now (by design) orchagent is asynchronous on SET/CREATE/REMOVE operations, and if some object is created, sairedis generates new VID, at this point syncd may not be running, or maybe restarting etc., so the RID for that object is not known at the creation time, (we assume each create all succeed) and since this is asynchronous we can combine multiple operations into bulk request to reduce number of interactions orchagent – redisdb – syncd, when each call would be synchronous to retrieve each RID every create call, that would not be possible, Of course, GET api is synchronous, but it should be kept to minimum
- By design we want to restart orchagent and syncd independently, so if orchagent would keep RID values, after syncd restart those RID values could change, be out of date etc., but since we use VID only in orchagent, there is not problem, after syncd restart it just recreate mapping of vid/rid and orchagent didn’t even noticed any change
-
Even if we put CL (comparison logic) to orchagent, we still need to transfer calls to syncd, so interface would stay the same
-
This is not so simple, since after syncd restart we need to recreate ASIC state, and orchagent at this point don’t even know that syncd restarted, so it would need to keep track of that and basically move restart syncd logic (hard reinit) to orchagent, what do you mean stabilize logic ? is not stable at this point?
-
I think Guohan said that we don’t plan to support unplanned warm restart, we do support unplanned orchagent/syncd cold restart that’s why we have comparison logic in place and warm restart is always user planned action, not sure where we would have unplanned warm restart scenario used? And at this point orchagent don’t need to worry about RID values after restart, only one set of VID values are required, and orchagent is interacting with syncd via sairedis like it would be normal SAI except 2 actions “init view” and “apply view”, in other words it is possible to compile orchagent against any vendor SAI and it would work without syncd.
How does view comparison logic deal with the duplication of leaf objects (objects don’t contain OID attributes)?Example objects are: ACL counter, LAG, tunnel Map, overlay and underlay router interface, buffer profile, policer, next hop group. Note that in current SONiC implementation these objects are created with the same non-OID attributes. How the view comparison logic distinguishes two or more objects created at same or different places? There might be more objects which fall into this category. The object dependency graph is a DAG, but how to prevent multiple instances of same objects from being confused with each other?
Lleaf objects *_entry like route_entry and fdb_entry, this is quit simple, since those are structs, we compare each of specific fields, so for example, in fdb_entry there would probably only 1 entry with specific mac address, for route_entry we can distinguish by prefix and vr_id (take a look to findCurrentBestMatchForRouteEntry function) and there will be no duplicated route_entry or fdb_entry etc., since if there would be it would point to the same object, for non *_entry objects like vlan_member, lag_member acl_entry etc, we match attributes of old and new object, and if all attributes are the same, we count reference number used on each view, and if those are the same, we make a guess, similar is happening for root objects like wred, scheduler, virtual router, etc.
For current configurations we tested, we only had to guess in buffer_profile and buffer_pool, of course there is room left in the code for each object type to be handled separately to improve CL, but currently everything runs on generic logic.
Yes graph is DAG and this is enforced by SAI/meta/saisanitycheck.c if you “make” meta directory in SAI, there will be file generated saidepgraph.svg (if you have ‘dot’ tool installed) and you can actually see how this graph looks like 😊 (attaching) this is generated automatically based on SAI metadata.
Check out https://github.com/Azure/sonic-sairedis/blob/master/syncd/syncd_applyview.cpp#L2185 and https://github.com/Azure/sonic-sairedis/blob/master/syncd/syncd_applyview.cpp#L2235
For more description in source code what’s happening there.
Why is Virtual OID needed at all?
- I assume virtual OID was designed to support warm reboot. With today’s implementation, all applications and orchagent will perform the state restore and reconciliation. What is missing is just the restore of OID for objects created from orchagent, which could be easily restored via saving the mapping between attributes and real object OID.
- With real OID, all the meta data validation may be done as well. Asynchronous object create might be a reason, but only “create/remove" operations for those objects of object_id type should be synchronous, “set" of object-id objects and operations for all other objects like route, neighbor may keep asynchronous. Anyway, in long term we want to have a end2end feedback path which requires synchronous call to ASIC.
- Without virtual OID, a whole bunch redundant logic processing could be eliminated. But I believe you must have some concrete reasoning behind the design of virtual OID, could you share more information on that?
Virtual OID is needed since by design we allow restart of syncd and orchagent independently, so imagine this scenario: you created 2 objects and you keep , then you restart syncd, syncd will recreate those 2 objects, then you remove 1 object and restart syncd, syncd will recreate state, orchagent will keep 1 object with the same VID, but in syncd this 1 object can have different RID than at restart time, this is needed , since we plan to update syncd/orchagent independently when we want to do update or fix issue
-
Not, virtual OID was not designed to support warm reboot, on warm restart no RID should change at all, reason for VID is as described above, if we would have only 1 process syncd +orchagent in one, then no VID is needed at all.
-
Guohan was talking about this what we want to have feedback on each api, and this is done independently of VID/RID map, since even if we will have synchronous APIs then we still want to restart syncd independently (probably) and we still need to recreate state, or move all logic from syncd to orchagent and notify orchagent that syncd was restarted and recreate state will be required and create new set or RID values after recreating state (for cold start).
-
Yes, this could be avoided if orchagent and syncd would be combined into 1 process, and reason is described in Ad 1 and A3.
PS. I had idea some time ago to make SAI and orchagent as modules to syncd, so we would have only 1 process combined syncd and orchagent, and when updating we would only need to load old module like *.so and load new module, and that is, since we would be in the same address space no need for all proxy/sairedis stuff and RID/VID values but there was a question about future compatibility of interface.