Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing: improved stream synthesis and cores use in router benchmark #4457

Merged
merged 20 commits into from
Dec 20, 2023

Conversation

jiceatscion
Copy link
Contributor

@jiceatscion jiceatscion commented Dec 13, 2023

The main goal isn't to improve the pkt/s number that comes out of the benchmark, but to make it more stable from
run to run on the same machine.

Changes:

  • Use only 4 cores; 3 for the router and one for brload. That's a configuration available on many machines.
  • Update the flowID of packets on the fly (just before sending) as opposed to pre-constructing a different packet for each stream. (As a result, the test cases are simplified, they no-longer need to provide one packet per flowID).
  • In passing, deleted the dumping of /proc/cpuinfo in the log. That has served its purpose.
  • In the test harness, add simple core selection and partitioning: avoid cpus with unreliable performance (such as hyperthreads). Prefer cpus from unshared L2 caches or all from the same L2 cache.
  • Added monitoring of packet drops, to verify that the router was loaded to capacity.
  • Update expectations.

This change is Reviewable

Changes:
* Update the flowID of packets (and a couple of checksums) on the fly (just before
  sending) as opposed to pre-constructing a different packet for each stream.
* As a result, the test cases are simplified, they no-longer need to provide one
  packet per flowID.
* In passing, deleted the dumping of /proc/cpuinfo in the log. That has served its purpose.
* In the test harness, left in the form of comments, a suggestion of core partitioning. Not
  sure how to apply that in general. This is what provides the best results on my machine.
Also enable core-partitioning to see how it behaves on the CI system.
In passing: fixed a couple of function names (camel-> snake)
Also: fixed the broken prom query string.
@jiceatscion jiceatscion changed the title testing: improved stream synthesis in router benchmark testing: improved stream synthesis and cores use in router benchmark Dec 14, 2023
Added L2 cache criteria. The test harness will prefer configuration where either cores share no L2 cache or all cores share the same one.
This is done as an attempt at detecting live migration.
… install.

Three files were missing, most notably the pkgs.txt files which are typically the ones
modified when adding a dependency. As a result new packages weren't getting installed.
This proved useless as a means of spotting migrations. The information doesn't
even change from instance to instance.
After fixing an arithmetic error, it seems we are not able to reliably cause the
router to drop more than 1% of the packets (and then sometimes no even that).
Until we figure out how to reach higher levels, making sure that other PRs
aren't blocked.
Copy link
Contributor

@matzf matzf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 8 files at r1, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @jiceatscion)


acceptance/router_benchmark/test.py line 95 at r1 (raw file):

    best = {cpus[0] for cpus in cores.values() if len(cpus) == 1}
    chosen = {}

Python gotcha: best is a set as intended, but chosen is a dict. Same bug a few lines further down.

chosen = set().


acceptance/router_benchmark/test.py line 144 at r1 (raw file):

        # In the first iteration, A picks only first choice cpus and B supplements the harvest
        # with second choice. If we still need more all first and second choice have been
        # exhausted and subsequent iterations pick whatever's left.

Nit: coding out the third category explicitly would seem to make this much more straight forward (no while loop and no quality counter needed).


acceptance/router_benchmark/test.py line 151 at r1 (raw file):

                break
            if len(cpus) == 1:
                report[quality] += 1

KeyError: 0


acceptance/router_benchmark/test.py line 228 at r1 (raw file):

            cpu = c["cpu"]
            core = c["core"]
            if all_cpus is None or len(all_cpus) == 0:

Already checked above. Perhaps you wanted to check that cpu and core are usable? Also use .get for these?


acceptance/router_benchmark/test.py line 243 at r1 (raw file):

                caches[l2] = [cpu]
            else:
                caches[l2].append(cpu)

Shortcut with setdefault. Same below for cores.

Suggestion:

            caches.setdefault(l2, []).append(cpu)

acceptance/router_benchmark/test.py line 390 at r1 (raw file):

               "--network", "container:prometheus",
               "--name", "router",
               "--cpuset-cpus", f"{','.join(map(str, self.router_cpus))}",

Drop the f"{}", this is already a string. Some more of these below.


acceptance/router_benchmark/test.py line 548 at r1 (raw file):

        # Log and check the saturation...
        # If this is used as a CI test. Make sure that the saturation is within the expected
        # ballpark (we expect at least 4% packet dropped due to queue overflow).

Nit: "4%" but the code has 1%. Maybe say "we expect a certain loss rate ..."?

Copy link
Contributor Author

@jiceatscion jiceatscion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @matzf)


acceptance/router_benchmark/test.py line 95 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

Python gotcha: best is a set as intended, but chosen is a dict. Same bug a few lines further down.

chosen = set().

F*g python. This one gets me at least once a year. And every time I notice the ambiguity I think python cleverly makes them indistinguishable. Then learn otherwise, then forget. Anyway... fixed.


acceptance/router_benchmark/test.py line 144 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

Nit: coding out the third category explicitly would seem to make this much more straight forward (no while loop and no quality counter needed).

Huh, actually that function was refactored one too many times and it's broken. It happens to no-longer be exercised on the CI or on my machine because the other two return better solutions. Fixing and re-testing. Regarding the style, I couldn't do it quite like you suggested, but you might like the rewrite.
Done


acceptance/router_benchmark/test.py line 151 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

KeyError: 0

Done.


acceptance/router_benchmark/test.py line 228 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

Already checked above. Perhaps you wanted to check that cpu and core are usable? Also use .get for these?

Right. Started it and... cookies?


acceptance/router_benchmark/test.py line 243 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

Shortcut with setdefault. Same below for cores.

Better, I created a defaultdict.


acceptance/router_benchmark/test.py line 390 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

Drop the f"{}", this is already a string. Some more of these below.

Oops.
Done


acceptance/router_benchmark/test.py line 548 at r1 (raw file):

Previously, matzf (Matthias Frei) wrote…

Nit: "4%" but the code has 1%. Maybe say "we expect a certain loss rate ..."?

Yeah. It started as 4% but proved a bit too ambitious after the arithmetic error got fixed. Updated the comment to remove the duplicate number.

Copy link
Contributor

@matzf matzf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @jiceatscion)

After switching to non-overbooked aws instances, performance has become much better
and repeatable.
Copy link
Contributor

@matzf matzf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @jiceatscion)

@jiceatscion jiceatscion merged commit 6b4627c into scionproto:master Dec 20, 2023
4 checks passed
@jiceatscion jiceatscion deleted the benchmark_streams branch December 20, 2023 14:05
juagargi pushed a commit to netsec-ethz/scion that referenced this pull request Mar 8, 2024
…cionproto#4457)

* testing: improved stream synthesis in router benchmark

Changes:
* Update the flowID of packets (and a couple of checksums) on the fly (just before
  sending) as opposed to pre-constructing a different packet for each stream.
* As a result, the test cases are simplified, they no-longer need to provide one
  packet per flowID.
* In passing, deleted the dumping of /proc/cpuinfo in the log. That has served its purpose.
* In the test harness, left in the form of comments, a suggestion of core partitioning. Not
  sure how to apply that in general. This is what provides the best results on my machine.

* testing: simplify the checksum updates out of the blaster.

Also enable core-partitioning to see how it behaves on the CI system.

* testing: Add check for proper saturation in the router benchmark.

In passing: fixed a couple of function names (camel-> snake)

* testing: fix expectation for droppage in the router benchmark

Was expressed in pkts instead of %.

* testing: cosmetic change to the output of expected drop ratio.

* testing: linter friendship

* testing: include a cpu selection in the router benchmark

Also: fixed the broken prom query string.

* testing: please the linter.

* testing: log lscpus on the CI system.

* testing: fix bad string formating.

* testing: Improve core selection in the router benchmark.

Added L2 cache criteria. The test harness will prefer configuration where either cores share no L2 cache or all cores share the same one.

* testing: make the router benchmark test harness compatible with more variants
of lscpu (hopefully).

* testing: log the CPU serial number in the router benchmark.

This is done as an attempt at detecting live migration.

* build: added dmidecode as a required package.

* build: Complete the list of files that affect the list of packages to install.

Three files were missing, most notably the pkgs.txt files which are typically the ones
modified when adding a dependency. As a result new packages weren't getting installed.

* build: Delete use of dmidecode.

This proved useless as a means of spotting migrations. The information doesn't
even change from instance to instance.

* build: fix trailing blank line.

* router: reduce the benchmarks expectations for saturation.

After fixing an arithmetic error, it seems we are not able to reliably cause the
router to drop more than 1% of the packets (and then sometimes no even that).
Until we figure out how to reach higher levels, making sure that other PRs
aren't blocked.

* testing: fixed bugs in router_benchmark core selection code.

* testing: update the expectations of the router benchmark.

After switching to non-overbooked aws instances, performance has become much better
and repeatable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants