Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability of path discovery #42

Merged
merged 19 commits into from
Jul 8, 2024
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions draft-dekater-scion-controlplane.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,18 @@ informative:
RFC6996:
RFC9217:
RFC9473:
BollRio-2000:
title: The diameter of a scale-free random graph
target: https://kam.mff.cuni.cz/~ksemweb/clanky/BollobasR-scale_free_random.pdf
author:
-
ins: B. Bollobás
name: Béla Bollobás
-
ins: O. Riordan
name: Oliver Riordan



--- abstract

Expand Down Expand Up @@ -1297,6 +1309,58 @@ In comparison to these time scales, clock offsets in the order of minutes are im
Each administrator of a SCION control service is responsible for maintaining sufficient clock accuracy. No particular method is assumed by this specification.


## Path Discovery Time and Scalability

The path discovery mechanism balances of the number of discovered paths and the time it takes to discover them versus resource overhead of the discovery.
matzf marked this conversation as resolved.
Show resolved Hide resolved

The resource costs for path discovery are communication overhead, processing and storage. Communication is transmitting the PCBs and occasionally obtaining the required PKI material. Processing cost is validating the signatures of the AS entries, signing new AS entries, and, to a lesser extent, evaluating the beaconing policies. Storage is both the temporary storage of PCBs before the next propagation interval, and the storage of complete discovered path segments.
All of these depend on the the number and length of the discovered path segments, that is, on the total number of AS entries of the discovered path segments.

Interesting metrics for scalability and speed of path discovery are the time until all discoverable path segments have been discovered after a "cold boot", and the time until new link is usable.
Generally, the time until a specific PCB is built depends on its length and the propagation interval.
At each AS, the PCB will be processed and propagated at the subsequent propagation event. As propagation events are not synchronized between different ASes, a PCB arrives at a random point in time during the interval and is buffered before potentially being propagated.
With a propagation interval T, the mean time until the PCB is propagated in one AS therefore is T / 2 and the mean total time for the propagation steps of a PCB of length L is L * T / 2 (with a variance of L * T^2 / 12).
matzf marked this conversation as resolved.
Show resolved Hide resolved

Note that link removal is not part of path discovery in SCION. For scheduled removal of links, operators let path segments expire. On link failures, end points route around the failed link by switching to different paths in the data plane.
nicorusti marked this conversation as resolved.
Show resolved Hide resolved

For more specific observations, we distinguish between intra- and inter-ISD beaconing.
As will become apparent, the inter-ISD beaconing results in excessive overhead with very large numbers of participating core ASes. The ideal topology for SCION is to keep the inter-ISD core network to a moderate size, to benefit from the divide-and-conquer partitioning of ASes into ISDs and the efficiency of the intra-ISD beaconing.

### Intra-ISD Beaconing
In the intra-ISD beaconing, PCBs are propagated top-down, along parent-child links, from core to leaf ASes. Each AS discovers path segments from itself to the core ASes of its ISD.

Typically, this directed, acyclic graph is narrow at the top, widens towards the leafs, and is relatively shallow; intermediate provider ASes have a large number of children, while they only have a small number of parents. The chain of intermediate providers from a leaf AS to a core AS is typically not long (e.g. local, regional, national provider, then core).

Each AS potentially receives PCBs for all down paths between core to itself. While the number of distinct provider chains to the core is typically moderate, the multiplicity of links between provider ASes has multiplicative effect on the number of PCBs. Once this number grows above the limit value of 50, ASes trim the set of PCBs propagated. As the choice is among different ways to transit the local AS, operators are well equipped to choose among this set of PCBs.
nicorusti marked this conversation as resolved.
Show resolved Hide resolved
Ultimately, the number of PCBs received by an AS per propagation interval remains bounded by 50 for each parent link of an AS, and at most 50 PCBs per child link are propagated. The length of these PCBs, and thus the number of AS entries to be processed and stored, is expected to be moderate and not grow considerably with network size. The total resource overhead for beacon propagation is easily manageable even for highly connected ASes.
nicorusti marked this conversation as resolved.
Show resolved Hide resolved

To illustrate this with some numbers, an AS with a rather large number of 100 parent links receives at most 5000 PCBs during a propagation interval. Assuming a generous average length of 10 AS entries for these PCBs, this corresponds to 50000 AS entries. Due to the variable length fields in AS entries, the sizes for storage and transmission cannot be predicted exactly, and we'll assume an average of 250 bytes per AS entry. At the shortest, and thus chattiest, allowed propagation period of 5 seconds, this corresponds to a total bandwidth of very roughly 2.5MB/s, and, processing 10000 signature verifications per second.
nicorusti marked this conversation as resolved.
Show resolved Hide resolved
If the same AS has 1000 child links, the propagation of the beacons will require signing one new AS entry for each of the propagated PCBs for each link (at most 50 per link), that is at most 50000 signatures per propagation event.
The total bandwidth for the propagation of these PCBs for all 1000 child links would, again very roughly, be around 25MB/s.
All of these are manageable with even modest consumer hardware.
nicorusti marked this conversation as resolved.
Show resolved Hide resolved

On a cold start of the network, path segments to each AS are discovered after a number of propagation steps proportional to the longest path. As mentioned, the longest path is typically not long. With a 5 second propagation period and a generous longest path of length 10, all path segments are discovered after 25 seconds on average.
nicorusti marked this conversation as resolved.
Show resolved Hide resolved

When a new parent-child link is added to the network, the parent AS will propagate the available PCBs in the next propagation event. If the AS on the child side of the new link is a leaf AS, path discovery is thus complete after one single propagation interval. Otherwise, child ASes at distance D below the new link, learn of the new link after D further propagation steps.
matzf marked this conversation as resolved.
Show resolved Hide resolved

### Inter-ISD Beaconing
In the inter-ISD core beaconing, PCBs are propagated omnidirectionally along core links. Each AS discovers path segments from itself to any other core AS.
The number of distinct paths through the core network is typically very large. To keep the overhead manageable, at most 5 path segments to every destination AS are discovered, and the propagation frequency is slower than in the intra-ISD beaconing (at least 60 seconds between propagation events).

Without making strong assumptions on the topology of the core network, we can assume that shortest paths through real world, internet-like networks are relatively short; for example, the Barabási-Albert random graph model predicts a diameter of log(N)/log(log(N)) for a network with N nodes {{BollRio-2000}}. The average distance scales in the same way.
We cannot assume that the selected PCBs are strictly shortest paths through the network, but it's reasonable to assume that they will not be very much longer than the shortest paths either.

With N the number of participating core ASes, an AS receives up to 5 * N PCBs per propagation interval per core link interface.
For highly connected ASes, the number of PCBs received thus becomes rather large. In a network of 1000 ASes, a highly connected AS with 300 core links receives up to 1.5 million PCBs per propagation interval.
Assuming an average PCB length of 6 and the shortest propagation interval of 60 seconds, this corresponds to roughly 150 thousand signature validations per second. This throughput can be achieved on a single core of a present day small server or desktop machine.
In terms of bandwidth, this corresponds to very roughly 38MB/s.
nicorusti marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe here we could summarize by saying that the overall message complexity for an AS is linear to the number of core ASes N.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's not, it's N times the path length. That's the whole buildup of this section:

  • [Resource costs] depend on the the number and length of the discovered path segments, that is, on the total number of AS entries of the discovered path segments.

  • Then we say that in core network, PCBs are roughly log(N) long.

  • With N the number of participating core ASes, an AS receives up to 5 * N PCBs per propagation interval per core link interface.

Copy link
Member

@nicorusti nicorusti Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification, as far as I understand then the message complexity in terms of number of signature validations per AS can be approximated with O(N*log(N)), while the amount of propagated PCBs per AS is O(N), correct?
If you agree, I still think it might be more understandable to directly mention it



On a cold start of the network, full connectivity is obtained after a number of propagation steps corresponding to the diameter of the network. Assuming a network diameter of 6, this corresponds to roughly 3 minutes on average.

When a new link is added to the network, it will be available to connect two ASes at distances from the link D1 and D2 from the link, respectively, after a mean time (D1+D2)*T/2.


# Registration of Path Segments {#path-segment-reg}

**Path registration** is the process where an AS transforms selected PCBs into path segments, and adds these segments to the relevant path databases, thus making them available to other ASes.
Expand Down
Loading