-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
testing: improved stream synthesis and cores use in router benchmark #4457
testing: improved stream synthesis and cores use in router benchmark #4457
Conversation
Changes: * Update the flowID of packets (and a couple of checksums) on the fly (just before sending) as opposed to pre-constructing a different packet for each stream. * As a result, the test cases are simplified, they no-longer need to provide one packet per flowID. * In passing, deleted the dumping of /proc/cpuinfo in the log. That has served its purpose. * In the test harness, left in the form of comments, a suggestion of core partitioning. Not sure how to apply that in general. This is what provides the best results on my machine.
Also enable core-partitioning to see how it behaves on the CI system.
In passing: fixed a couple of function names (camel-> snake)
Was expressed in pkts instead of %.
Also: fixed the broken prom query string.
Added L2 cache criteria. The test harness will prefer configuration where either cores share no L2 cache or all cores share the same one.
…variants of lscpu (hopefully).
This is done as an attempt at detecting live migration.
… install. Three files were missing, most notably the pkgs.txt files which are typically the ones modified when adding a dependency. As a result new packages weren't getting installed.
This proved useless as a means of spotting migrations. The information doesn't even change from instance to instance.
After fixing an arithmetic error, it seems we are not able to reliably cause the router to drop more than 1% of the packets (and then sometimes no even that). Until we figure out how to reach higher levels, making sure that other PRs aren't blocked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 8 files at r1, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @jiceatscion)
acceptance/router_benchmark/test.py
line 95 at r1 (raw file):
best = {cpus[0] for cpus in cores.values() if len(cpus) == 1} chosen = {}
Python gotcha: best
is a set as intended, but chosen
is a dict. Same bug a few lines further down.
chosen = set()
.
acceptance/router_benchmark/test.py
line 144 at r1 (raw file):
# In the first iteration, A picks only first choice cpus and B supplements the harvest # with second choice. If we still need more all first and second choice have been # exhausted and subsequent iterations pick whatever's left.
Nit: coding out the third category explicitly would seem to make this much more straight forward (no while loop and no quality counter needed).
acceptance/router_benchmark/test.py
line 151 at r1 (raw file):
break if len(cpus) == 1: report[quality] += 1
KeyError: 0
acceptance/router_benchmark/test.py
line 228 at r1 (raw file):
cpu = c["cpu"] core = c["core"] if all_cpus is None or len(all_cpus) == 0:
Already checked above. Perhaps you wanted to check that cpu
and core
are usable? Also use .get
for these?
acceptance/router_benchmark/test.py
line 243 at r1 (raw file):
caches[l2] = [cpu] else: caches[l2].append(cpu)
Shortcut with setdefault
. Same below for cores
.
Suggestion:
caches.setdefault(l2, []).append(cpu)
acceptance/router_benchmark/test.py
line 390 at r1 (raw file):
"--network", "container:prometheus", "--name", "router", "--cpuset-cpus", f"{','.join(map(str, self.router_cpus))}",
Drop the f"{}"
, this is already a string. Some more of these below.
acceptance/router_benchmark/test.py
line 548 at r1 (raw file):
# Log and check the saturation... # If this is used as a CI test. Make sure that the saturation is within the expected # ballpark (we expect at least 4% packet dropped due to queue overflow).
Nit: "4%" but the code has 1%. Maybe say "we expect a certain loss rate ..."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @matzf)
acceptance/router_benchmark/test.py
line 95 at r1 (raw file):
Previously, matzf (Matthias Frei) wrote…
Python gotcha:
best
is a set as intended, butchosen
is a dict. Same bug a few lines further down.
chosen = set()
.
F*g python. This one gets me at least once a year. And every time I notice the ambiguity I think python cleverly makes them indistinguishable. Then learn otherwise, then forget. Anyway... fixed.
acceptance/router_benchmark/test.py
line 144 at r1 (raw file):
Previously, matzf (Matthias Frei) wrote…
Nit: coding out the third category explicitly would seem to make this much more straight forward (no while loop and no quality counter needed).
Huh, actually that function was refactored one too many times and it's broken. It happens to no-longer be exercised on the CI or on my machine because the other two return better solutions. Fixing and re-testing. Regarding the style, I couldn't do it quite like you suggested, but you might like the rewrite.
Done
acceptance/router_benchmark/test.py
line 151 at r1 (raw file):
Previously, matzf (Matthias Frei) wrote…
KeyError: 0
Done.
acceptance/router_benchmark/test.py
line 228 at r1 (raw file):
Previously, matzf (Matthias Frei) wrote…
Already checked above. Perhaps you wanted to check that
cpu
andcore
are usable? Also use.get
for these?
Right. Started it and... cookies?
acceptance/router_benchmark/test.py
line 243 at r1 (raw file):
Previously, matzf (Matthias Frei) wrote…
Shortcut with
setdefault
. Same below forcores
.
Better, I created a defaultdict.
acceptance/router_benchmark/test.py
line 390 at r1 (raw file):
Previously, matzf (Matthias Frei) wrote…
Drop the
f"{}"
, this is already a string. Some more of these below.
Oops.
Done
acceptance/router_benchmark/test.py
line 548 at r1 (raw file):
Previously, matzf (Matthias Frei) wrote…
Nit: "4%" but the code has 1%. Maybe say "we expect a certain loss rate ..."?
Yeah. It started as 4% but proved a bit too ambitious after the arithmetic error got fixed. Updated the comment to remove the duplicate number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @jiceatscion)
After switching to non-overbooked aws instances, performance has become much better and repeatable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @jiceatscion)
…cionproto#4457) * testing: improved stream synthesis in router benchmark Changes: * Update the flowID of packets (and a couple of checksums) on the fly (just before sending) as opposed to pre-constructing a different packet for each stream. * As a result, the test cases are simplified, they no-longer need to provide one packet per flowID. * In passing, deleted the dumping of /proc/cpuinfo in the log. That has served its purpose. * In the test harness, left in the form of comments, a suggestion of core partitioning. Not sure how to apply that in general. This is what provides the best results on my machine. * testing: simplify the checksum updates out of the blaster. Also enable core-partitioning to see how it behaves on the CI system. * testing: Add check for proper saturation in the router benchmark. In passing: fixed a couple of function names (camel-> snake) * testing: fix expectation for droppage in the router benchmark Was expressed in pkts instead of %. * testing: cosmetic change to the output of expected drop ratio. * testing: linter friendship * testing: include a cpu selection in the router benchmark Also: fixed the broken prom query string. * testing: please the linter. * testing: log lscpus on the CI system. * testing: fix bad string formating. * testing: Improve core selection in the router benchmark. Added L2 cache criteria. The test harness will prefer configuration where either cores share no L2 cache or all cores share the same one. * testing: make the router benchmark test harness compatible with more variants of lscpu (hopefully). * testing: log the CPU serial number in the router benchmark. This is done as an attempt at detecting live migration. * build: added dmidecode as a required package. * build: Complete the list of files that affect the list of packages to install. Three files were missing, most notably the pkgs.txt files which are typically the ones modified when adding a dependency. As a result new packages weren't getting installed. * build: Delete use of dmidecode. This proved useless as a means of spotting migrations. The information doesn't even change from instance to instance. * build: fix trailing blank line. * router: reduce the benchmarks expectations for saturation. After fixing an arithmetic error, it seems we are not able to reliably cause the router to drop more than 1% of the packets (and then sometimes no even that). Until we figure out how to reach higher levels, making sure that other PRs aren't blocked. * testing: fixed bugs in router_benchmark core selection code. * testing: update the expectations of the router benchmark. After switching to non-overbooked aws instances, performance has become much better and repeatable.
The main goal isn't to improve the pkt/s number that comes out of the benchmark, but to make it more stable from
run to run on the same machine.
Changes:
This change is