testing: improved stream synthesis and cores use in router benchmark #4457

jiceatscion · 2023-12-13T18:02:45Z

The main goal isn't to improve the pkt/s number that comes out of the benchmark, but to make it more stable from
run to run on the same machine.

Changes:

Use only 4 cores; 3 for the router and one for brload. That's a configuration available on many machines.
Update the flowID of packets on the fly (just before sending) as opposed to pre-constructing a different packet for each stream. (As a result, the test cases are simplified, they no-longer need to provide one packet per flowID).
In passing, deleted the dumping of /proc/cpuinfo in the log. That has served its purpose.
In the test harness, add simple core selection and partitioning: avoid cpus with unreliable performance (such as hyperthreads). Prefer cpus from unshared L2 caches or all from the same L2 cache.
Added monitoring of packet drops, to verify that the router was loaded to capacity.
Update expectations.

This change is

Changes: * Update the flowID of packets (and a couple of checksums) on the fly (just before sending) as opposed to pre-constructing a different packet for each stream. * As a result, the test cases are simplified, they no-longer need to provide one packet per flowID. * In passing, deleted the dumping of /proc/cpuinfo in the log. That has served its purpose. * In the test harness, left in the form of comments, a suggestion of core partitioning. Not sure how to apply that in general. This is what provides the best results on my machine.

Also enable core-partitioning to see how it behaves on the CI system.

In passing: fixed a couple of function names (camel-> snake)

Was expressed in pkts instead of %.

Also: fixed the broken prom query string.

Added L2 cache criteria. The test harness will prefer configuration where either cores share no L2 cache or all cores share the same one.

…variants of lscpu (hopefully).

This is done as an attempt at detecting live migration.

… install. Three files were missing, most notably the pkgs.txt files which are typically the ones modified when adding a dependency. As a result new packages weren't getting installed.

This proved useless as a means of spotting migrations. The information doesn't even change from instance to instance.

After fixing an arithmetic error, it seems we are not able to reliably cause the router to drop more than 1% of the packets (and then sometimes no even that). Until we figure out how to reach higher levels, making sure that other PRs aren't blocked.

matzf

Reviewed 8 of 8 files at r1, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @jiceatscion)

acceptance/router_benchmark/test.py line 95 at r1 (raw file):

    best = {cpus[0] for cpus in cores.values() if len(cpus) == 1}
    chosen = {}

Python gotcha: best is a set as intended, but chosen is a dict. Same bug a few lines further down.

chosen = set().

acceptance/router_benchmark/test.py line 144 at r1 (raw file):

        # In the first iteration, A picks only first choice cpus and B supplements the harvest
        # with second choice. If we still need more all first and second choice have been
        # exhausted and subsequent iterations pick whatever's left.

Nit: coding out the third category explicitly would seem to make this much more straight forward (no while loop and no quality counter needed).

acceptance/router_benchmark/test.py line 151 at r1 (raw file):

                break
            if len(cpus) == 1:
                report[quality] += 1

KeyError: 0

acceptance/router_benchmark/test.py line 228 at r1 (raw file):

            cpu = c["cpu"]
            core = c["core"]
            if all_cpus is None or len(all_cpus) == 0:

Already checked above. Perhaps you wanted to check that cpu and core are usable? Also use .get for these?

acceptance/router_benchmark/test.py line 243 at r1 (raw file):

                caches[l2] = [cpu]
            else:
                caches[l2].append(cpu)

Shortcut with setdefault. Same below for cores.

Suggestion:

            caches.setdefault(l2, []).append(cpu)

acceptance/router_benchmark/test.py line 390 at r1 (raw file):

               "--network", "container:prometheus",
               "--name", "router",
               "--cpuset-cpus", f"{','.join(map(str, self.router_cpus))}",

Drop the f"{}", this is already a string. Some more of these below.

acceptance/router_benchmark/test.py line 548 at r1 (raw file):

        # Log and check the saturation...
        # If this is used as a CI test. Make sure that the saturation is within the expected
        # ballpark (we expect at least 4% packet dropped due to queue overflow).

Nit: "4%" but the code has 1%. Maybe say "we expect a certain loss rate ..."?

jiceatscion

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @matzf)

acceptance/router_benchmark/test.py line 95 at r1 (raw file):