testing: add router benchmark program and use it as integration test #4437

jiceatscion · 2023-11-06T18:36:55Z

Summary of changes:

Enhance the end2end_integration programs as follows:
- end2end_integration has a slightly finer-grain role-based pair selection capability.
- end2end_integration outputs an extra result line that provides precise start/stop timestamps for the whole test.
process_metrics exposes the number of cores used by Go.
Add end2endblast. Similar to end2end but plays a game other than pingpong: packetblast. It keeps logging and tracing to a minimum. The server side responds to very few of the pings (1/256). Just enough to prove that the circuit works.
Add an integration test called router_benchmark:
- relies on an especially crafted topology: router_bm.topo
- uses end2end_integration with command end2endblast over that topology as a load test for the router
- extracts and logs relevant performance metrics after the test
- compares performances with a set of expectations if executed by CI.

As a benchmark program, this is only a first approximation. Extending the end2end suite provided a quick solution but isn't ideal: it uses too many cores at once due to requiring 5 routers actively running for one of the tests. More work is in progress to replace this with some leaner setup where only one router is actually running.

Fixes: #4410
Fixes: #4411

This change is

Miscellaneous facts: * This load test can be used to benchmark the router. * The test orchestrator is a simplified variant of the end2end_integration test. * The leg-work is done by the end2end binary, which has been extended to play other games than ping-pong. * The test always exercises all available pairs. For the purpose of benchmarking, it relies on a bespoke topology: router_bm.topo, which has just enough to exercise all types of forwarding.

…outine closure. Just let it be captured.

For now the test is not accomplishing much. It runs the benchmark and collects "some" metrics. The metrics are just printed out in raw form. A future version will collect upcoming (separate PR) metrics that are actually relevant to a router benchmark, and compare them with some expectations.

I needed most of the features in end2end_integration so there was no benefit in having a separate executable. It turned out impossible to create a custom topology that would produce only one type of traffic at one router while exercising all possible role pairs. One still needs to chose role pairs to obtain that. Also I couldn't do it with only the existing subsets. I needed to add noncore#nonlocalcore and a slightly extended topology.

Also suppress most pongs in the packetflood game, in order to allow some routers to have only IN or only OUT traffic.

Mainly: * Added some more flexibility in the end2end_integration subset selection. * Make use of samein router_benchmark test. * Suppressed some more tracing under the traces=false option. * Cache the path to the target during during the packetflood game.

Changes: * added flowID to ping messages in the router benchmark program. * fix ping response suppression. Do not suppress one in 256 responses, but allow one in 256. * fixed typoe in log. * fix metric labeling. The proper interface ID for each metric had been lost in a refactoring.

Changes: * Make server methods take a pointer receiver so pongs count isn't reset for each call. * Formatting and so on.

Mainly: * Add mertic to expose number of cores used by Go. * Clarify running/runnable metrics. * Collect the correct performance data from the test run. * Fixed the router_bm topo so that there actually are two ISDs.

Have the test driver supply the best metrics time-window to the test harness.

Rely on wait_connectivity rather than the passage of time.

This is in the faint hope of evading a merge conflict.

...Or may be I had already rebased on top of the trunk's fix. Don't know.

Had to skip the autogenerated docs in rules_openapi/tools as it seems we have no control on those and mdlint doesn't like them anymore.

Checks are only performed during CI. Expectations have been extrapolated from one initial CI run.

That was the problem.

Just so that I can easily see what the CI environment is like.

Delete edundant log. Move core count cpature and output to the end.

Five seconds gave too much variability. It seems that pkt rate and cpu rate don't align very well or that cpu rate is very volatile. That caused pkts/cpu-sec to have large random spikes. With 10s the spikes are gone.

Since switching to a 10s window the average observed performance may have been reduced a bit.

Summary: * In the router, prevent the Go runtime from trying to use more cores than the specified number of processor threads. Reason: * Be consistent with the default where there are as many processor threads as there are cores. * In non-default cases, we want to be able to support multiple routers on the same host. When doing that, Go's assumption that all the cores are fair game distorts the performance as all router instances compete for the same cores and so incurr far more preemption than they would in a normal deployment. * In the test harness, configure the number of cores allowed to each router based on what's available on the host running the test. * In the end2end binary, terminate the pong drain task quickly instead of waiting up to 10s. This allows to report the end-time of the test accurately, thereby removing a lot of variability in the results (before, the metrics capture window could overlap with the end of the active phase of the test). * Because the test goes faster, increase the number of packets to 1.5M, so we are sure we have a reliable 10s metrics capture window in the middle.

matzf

Reviewed 2 of 14 files at r1.
Reviewable status: 2 of 15 files reviewed, 6 unresolved discussions (waiting on @jiceatscion)

.buildkite/pipeline_lib.sh line 20 at r2 (raw file):

        if [[ "$test" =~ router_benchmark ]]; then
            args="--test_arg=--ci"
	fi

There should already be an env variable CI which could be used instead of this flag. If I understand the plumbum API correctly, we could even set the flag implicitly based on an envvar using cli.Flag(...., envname="CI").

acceptance/router_benchmark/BUILD.bazel line 8 at r2 (raw file):

    args = [
        "--executable=end2end_integration:$(location //tools/end2end_integration)",
    ],

I'm actually a bit confused how this can even run the ./bin/end2end as a subcommand binary. How is this even built and how come this is on the appropriate path to run here? Usually bazel is such a stickler about this...

acceptance/router_benchmark/test.py line 3 at r2 (raw file):

#!/usr/bin/env python3

# Copyright 2020 Anapaya Systems

Nit copy pasted copyright

pkg/snet/packet_conn.go line 118 at r2 (raw file):

	// FlowID: arbitrary value to distribute connections over router CPUs.
	// Initialize to 1 if backward compatible (no-distribution) behavior is desired.
	FlowID uint32

Could we deal with this Flow ID change in a separate PR, so that this gets the appropriate visibility?

IMO, PacketConn this is not the right API level to expose this. The PacketConn is sort of a "raw" connection. A user of this API can be expected to set the FlowID directly in the Packet.

I would add this to the snet.Conn instead, setting it when creating a packet in scionConnWriter.WriteTo. We can e.g. set this based on a hash of src/dst ISD, AS, Host address and Port (see e.g. https://www.rfc-editor.org/rfc/rfc6437#appendix-A).
I don't think this requires a backwards compatibility layer. From an application perspective, with using a constant FlowID, we would have possibly had an ordering guarantee between packets sent from different snet.Conns. I doubt that this would even have worked in the first place (dispatcher / sending host might not guarantee the send order already). (Ab-)Using this in an application would probably have been tricky enough that I'm sure someone would have noticed that it's a bad idea.

tools/end2end/main.go line 17 at r2 (raw file):

// This is a general purpose client/server code for end2end tests. It plays
// ping-pong with some variantions depending on command line arguments.

I'm not sure about generalizing this end2end thing. It feels to me like this makes things much more complicated than they need to be.

Wouldn't a separate "blaster" be much shorter and simpler than this? We only want run one test at a time, and so we don't seem to benefit much from the end2end_integration harness. The server side becomes trivial if we don't send any replies and just report received packets out of band.

Note that we could also use the end2end_integration harness with a different command than end2end -- it just assumes that some flags are supported -- so this could still be used with a simpler "blaster" test.

I don't know, what do you think?

tools/integration/integrationlib/common.go line 150 at r2 (raw file):

		}
		log.Info("Retrying...")
		time.Sleep(integration.RetryTimeout)

Nit. This is now printing "Retrying..." after the last attempt.

Observed numbers have about double after improving test.

jiceatscion

Reviewable status: 2 of 15 files reviewed, 2 unresolved discussions (waiting on @matzf)

.buildkite/pipeline_lib.sh line 20 at r2 (raw file):

Previously, matzf (Matthias Frei) wrote…

There should already be an env variable CI which could be used instead of this flag. If I understand the plumbum API correctly, we could even set the flag implicitly based on an envvar using cli.Flag(...., envname="CI").

Done.
I wasn't aware. Works like a charm.

acceptance/router_benchmark/BUILD.bazel line 8 at r2 (raw file):

Previously, matzf (Matthias Frei) wrote…

I'm actually a bit confused how this can even run the ./bin/end2end as a subcommand binary. How is this even built and how come this is on the appropriate path to run here? Usually bazel is such a stickler about this...

I honestly don't know. I didn't invent this; I cargo-culted it from trc_update/BUILD.bazel. Apparently the value of "--executable" is actually a mapping: "end2end_integration" -> path_of(tools/end2end_integration). The rest is the business of common/base.py (which builds the map) and /test.py that peruses it. Bazel is content because the executable file is listed as a data dependency of the test.

acceptance/router_benchmark/test.py line 3 at r2 (raw file):

Previously, matzf (Matthias Frei) wrote…

Nit copy pasted copyright

Done

pkg/snet/packet_conn.go line 118 at r2 (raw file):

Previously, matzf (Matthias Frei) wrote…

Could we deal with this Flow ID change in a separate PR, so that this gets the appropriate visibility?

IMO, PacketConn this is not the right API level to expose this. The PacketConn is sort of a "raw" connection. A user of this API can be expected to set the FlowID directly in the Packet.

I would add this to the snet.Conn instead, setting it when creating a packet in scionConnWriter.WriteTo. We can e.g. set this based on a hash of src/dst ISD, AS, Host address and Port (see e.g. https://www.rfc-editor.org/rfc/rfc6437#appendix-A).
I don't think this requires a backwards compatibility layer. From an application perspective, with using a constant FlowID, we would have possibly had an ordering guarantee between packets sent from different snet.Conns. I doubt that this would even have worked in the first place (dispatcher / sending host might not guarantee the send order already). (Ab-)Using this in an application would probably have been tricky enough that I'm sure someone would have noticed that it's a bad idea.

I suspected you'd object to such an ad-hoc API change. I just did a couple of runs with and without... the performance difference is actually unknown because I just discovered while deleting the code that I had forgotten to use the passed flowID value when serializing! So, I guess we can live without that until we had time to add the feature properly.

tools/end2end/main.go line 17 at r2 (raw file):

Previously, matzf (Matthias Frei) wrote…

I'm not sure about generalizing this end2end thing. It feels to me like this makes things much more complicated than they need to be.

Wouldn't a separate "blaster" be much shorter and simpler than this? We only want run one test at a time, and so we don't seem to benefit much from the end2end_integration harness. The server side becomes trivial if we don't send any replies and just report received packets out of band.

Note that we could also use the end2end_integration harness with a different command than end2end -- it just assumes that some flags are supported -- so this could still be used with a simpler "blaster" test.

I don't know, what do you think?

A separate blaster may be a "simpler" solution to execute, but not to code, because end2end is already written. I was about to write a separate tool when I realized that the changes I needed to make to end2end to satisfy my needs where very simple. In the long run I agree with you. We'll eventually make a specialized tool. Not because it's easier but because it can test fewer components at a time (i.e. one), which will give more accurate readings. For now, I'm happy that I could meter all 5 traffic types in two runs with one topology. I find it pretty convenient to get some early results. So, if it's ok with you, I'd prefer to keep this for now and do another round of improvements in a later PR.

tools/integration/integrationlib/common.go line 150 at r2 (raw file):

Previously, matzf (Matthias Frei) wrote…

Nit. This is now printing "Retrying..." after the last attempt.

Done

Summary: * Stop abusing the num_processors config variable and add a num_cores config for that purpose. (This gives us the flexibility of configuring processors and cores independently. * Adjust the core allocation scheme in the test harness. Gives better results and better balance in presence of other than 12 cores. * Document the new config variable.

matzf

Reviewed 1 of 14 files at r1, 5 of 9 files at r5, all commit messages.
Reviewable status: 7 of 16 files reviewed, 4 unresolved discussions (waiting on @jiceatscion)

acceptance/router_benchmark/BUILD.bazel line 8 at r2 (raw file):

Previously, jiceatscion wrote…

I honestly don't know. I didn't invent this; I cargo-culted it from trc_update/BUILD.bazel. Apparently the value of "--executable" is actually a mapping: "end2end_integration" -> path_of(tools/end2end_integration). The rest is the business of common/base.py (which builds the map) and /test.py that peruses it. Bazel is content because the executable file is listed as a data dependency of the test.

Yup, I saw that it's the same in trc_update. I do understand the end2end_integration, but I'm confused about end2end which is invoked by end2end_integration. I think this is escapes the bazel sandbox somehow, which probably means that it only works because we build into bin/ before.

doc/manuals/router.rst line 182 at r5 (raw file):

      The send buffer size in bytes. 0 means use system default.

   .. option:: router.num_cores = <int> (Default: ``GOMAXPROCS``)

I would prefer not to create a redundant option to control this. The existing GOMAXPROCS already does what we need. I think it's a great idea to document better that this (and other GO...) environment variables apply.

In the test setup, adding an environment variable is just as easy as editing the toml, it can be set in the docker-compose file.
(Btw, the compose file also has some options to limit cpu usage. AFAIU, setting the cpuset option would result in an affinity mask which the go runtime does take into account for the default GOMAXPROC value. The go runtime does not, however, take into account CPU quotas with cpus value.)

pkg/snet/dispatcher.go line 59 at r5 (raw file):

	ia addr.IA,
	registration *net.UDPAddr,
	svc addr.SVC) (PacketConn, uint16, error) {

Nit: maybe revert this file and snet/packet_conn.go entirely to avoid any git blame and or merge issues.

pkg/snet/packet_conn.go line 118 at r2 (raw file):

Previously, jiceatscion wrote…

I suspected you'd object to such an ad-hoc API change. I just did a couple of runs with and without... the performance difference is actually unknown because I just discovered while deleting the code that I had forgotten to use the passed flowID value when serializing! So, I guess we can live without that until we had time to add the feature properly.

Ok!

tools/end2end/main.go line 17 at r2 (raw file):

A separate blaster may be a "simpler" solution to execute, but not to code, because end2end is already written.

The end2end tool grew by about 40% lines of code to include the packetflood. My gut feeling is that with a copy-pasted and stream-lined version of end2end that only does the "packetflood", we would end up with not much more code in total.
I don't like that this now conflates two mostly separate code paths just to share some of the boilerplate, but I think I don't feel strongly enough about this end2end test to object. Especially, as you already have plans to replace this. @scionproto/scion-core-team opinions?

topology/router_bm.topo line 1 at r5 (raw file):

--- # Bespoke Topology for router benchmarking:

Nit. For other tests that expect specific topology, we put this .topo file into the testdata, i.e. acceptance/router_benchmark/testdata/. Probably a good way to keep things tidy.

jiceatscion

Reviewable status: 7 of 16 files reviewed, 2 unresolved discussions (waiting on @matzf)

acceptance/router_benchmark/BUILD.bazel line 8 at r2 (raw file):

Previously, matzf (Matthias Frei) wrote…

Yup, I saw that it's the same in trc_update. I do understand the end2end_integration, but I'm confused about end2end which is invoked by end2end_integration. I think this is escapes the bazel sandbox somehow, which probably means that it only works because we build into bin/ before.

mmm, end2end_integration might need to have bin/end2end as a data dependency too. It seems to be enough. Although I don't know precisely how bazel manages it.

doc/manuals/router.rst line 182 at r5 (raw file):

Previously, matzf (Matthias Frei) wrote…

I would prefer not to create a redundant option to control this. The existing GOMAXPROCS already does what we need. I think it's a great idea to document better that this (and other GO...) environment variables apply.

In the test setup, adding an environment variable is just as easy as editing the toml, it can be set in the docker-compose file.
(Btw, the compose file also has some options to limit cpu usage. AFAIU, setting the cpuset option would result in an affinity mask which the go runtime does take into account for the default GOMAXPROC value. The go runtime does not, however, take into account CPU quotas with cpus value.)

Well, yes it occured to me that we could use the env var directly. I just loathe env vars. They pop in and out of existence with no dicernable origin and no scope. I thought I was doing us a favor by adding an explicit config... but ok. I used the env var instead.

pkg/snet/dispatcher.go line 59 at r5 (raw file):

Previously, matzf (Matthias Frei) wrote…

Nit: maybe revert this file and snet/packet_conn.go entirely to avoid any git blame and or merge issues.

done

tools/end2end/main.go line 17 at r2 (raw file):

Previously, matzf (Matthias Frei) wrote…

A separate blaster may be a "simpler" solution to execute, but not to code, because end2end is already written.

The end2end tool grew by about 40% lines of code to include the packetflood. My gut feeling is that with a copy-pasted and stream-lined version of end2end that only does the "packetflood", we would end up with not much more code in total.
I don't like that this now conflates two mostly separate code paths just to share some of the boilerplate, but I think I don't feel strongly enough about this end2end test to object. Especially, as you already have plans to replace this. @scionproto/scion-core-team opinions?

Yeah, in terms of code quantity it ended-up about the same. It was just an easier (seeming) path to follow. Ideally I would just ditch this and replace it with the packet blaster approach, but something tells me it isn't going to be a 2 days job. The nitty-gritty is rather messy. I really don't want to drag this PR on forever. The test itself may not be great but the pattern is established and produces result. So I'd like to submit it. However I have no problem moving the packetflood game into its own binary. So...stay tuned for that one.

topology/router_bm.topo line 1 at r5 (raw file):

Previously, matzf (Matthias Frei) wrote…

Nit. For other tests that expect specific topology, we put this .topo file into the testdata, i.e. acceptance/router_benchmark/testdata/. Probably a good way to keep things tidy.

router_benchmark is like end2end_integration. You can tun it by hand outside the test framework. That's why I put the topo file in the same location. But yeah, I can move the file. Done

Specifically: * Use the GOMAXPROCS environment variable instead of a setting config field for the routers. * Move the router_bm.topo file to router_benchmark/test_data/ * Full restore of some unchanged files. * Add an explicit dependency of end2end_integration to the end2end binary.

jiceatscion

Reviewable status: 4 of 17 files reviewed, 2 unresolved discussions (waiting on @matzf)

doc/manuals/router.rst line 182 at r5 (raw file):

Previously, jiceatscion wrote…

Well, yes it occured to me that we could use the env var directly. I just loathe env vars. They pop in and out of existence with no dicernable origin and no scope. I thought I was doing us a favor by adding an explicit config... but ok. I used the env var instead.

done

Moved it to a sibling binary named end2endblast. As a result end2end* do not take a "game" argument any more.

jiceatscion

Reviewable status: 4 of 21 files reviewed, 1 unresolved discussion (waiting on @matzf)

tools/end2end/main.go line 17 at r2 (raw file):

Previously, jiceatscion wrote…

Yeah, in terms of code quantity it ended-up about the same. It was just an easier (seeming) path to follow. Ideally I would just ditch this and replace it with the packet blaster approach, but something tells me it isn't going to be a 2 days job. The nitty-gritty is rather messy. I really don't want to drag this PR on forever. The test itself may not be great but the pattern is established and produces result. So I'd like to submit it. However I have no problem moving the packetflood game into its own binary. So...stay tuned for that one.

done

matzf

Reviewed 1 of 14 files at r1, 7 of 14 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: 13 of 21 files reviewed, 5 unresolved discussions (waiting on @jiceatscion)

acceptance/router_benchmark/BUILD.bazel line 8 at r2 (raw file):

Previously, jiceatscion wrote…

mmm, end2end_integration might need to have bin/end2end as a data dependency too. It seems to be enough. Although I don't know precisely how bazel manages it.

For the record; as you found out, the end2end is baked into the tester docker image, that's how it ends up in the right place.

doc/manuals/router.rst line 182 at r5 (raw file):

Previously, jiceatscion wrote…

done

Ah, I see. I appreciate that you want to improve the configuration, sorry for rudely shooting it down!
As we cannot get rid of the environment variable even if we wanted, also exposing this as a configuration option means that the settings can conflict and one takes precedence, which, IMO, is even nastier than the environment variable alone. Hope this makes some sense.

doc/manuals/router.rst line 214 at r7 (raw file):

      The number of kernel threads that go creates depends on the number of usable cores, which is
      controlled by the environment variable ``GOMAXPROCS``. See :ref:`GOMAXPROCS <gomaxprocs>`.

Nit: use the :env: role, this directly creates a reference and the right formatting. Then we don't need the separate anchor _gomaxprocs.

Suggestion:

controlled by the environment variable :env:`GOMAXPROCS`.

tools/end2end/main.go line 17 at r2 (raw file):

Previously, jiceatscion wrote…

done

Ok, got it. Seems good to me.

tools/end2end/main.go line 360 at r7 (raw file):

}

func (c *client) ping(ctx context.Context, n int, path snet.Path, logIfOk bool) error {

Nit: remove the logIfOk (true here in end2end, false in end2endblast).

tools/end2endblast/main.go line 117 at r7 (raw file):

	flag.Var(timeout, "timeout", "The timeout for each attempt")
	flag.BoolVar(&epic, "epic", false, "Enable EPIC")
	flag.BoolVar(&traces, "traces", true, "Enable Jaeger traces")

Nit: as far as I understand, the traces as they are implemented here never make much sense for the blaster. I think it would make sense to remove this flag here and drop all the related code. In the end2end, on the other hand, we could revert to always enable this.

tools/end2endblast/main.go line 306 at r7 (raw file):

	defer c.sdConn.Close()
	c.errorPaths = make(map[snet.PathFingerprint]struct{})
	pong_out := make(chan int)

Nit: pongOut (also pingResult, pongResult, ...)

tools/end2endblast/main.go line 359 at r7 (raw file):

			return true
		}
		c.path = p // struct fields cannot be assigned with :=

Nit: in contrast to end2end, this does not switch to alternative paths on errors. We could clean this up by moving this getRemote call out, before the first call of the blindPing.
We can also remove everything related to errorPaths

Namely: * Doc formating * Simplify end2endblast now that its function is separate from that of end2end. * Restore endend almost completely to its original form for the same reason.

jiceatscion

Reviewable status: 13 of 21 files reviewed, 1 unresolved discussion (waiting on @matzf)

doc/manuals/router.rst line 182 at r5 (raw file):

Previously, matzf (Matthias Frei) wrote…

Ah, I see. I appreciate that you want to improve the configuration, sorry for rudely shooting it down!
As we cannot get rid of the environment variable even if we wanted, also exposing this as a configuration option means that the settings can conflict and one takes precedence, which, IMO, is even nastier than the environment variable alone. Hope this makes some sense.

Well, yeah. For the config approach to be an actual improvement, we would need to also suppress the effect of the env var (which we can, btw: call GOMAXPROC(NumCPU()) by default). But that's added mess for too little benefit. I do find that there are too many sources of configuration in general; not just for this one thing, so if it gets addressed it better be as a separate more comprehensive effort.

tools/end2end/main.go line 360 at r7 (raw file):

Previously, matzf (Matthias Frei) wrote…

Nit: remove the logIfOk (true here in end2end, false in end2endblast).

done

tools/end2endblast/main.go line 117 at r7 (raw file):

Previously, matzf (Matthias Frei) wrote…

Nit: as far as I understand, the traces as they are implemented here never make much sense for the blaster. I think it would make sense to remove this flag here and drop all the related code. In the end2end, on the other hand, we could revert to always enable this.

Done. Might as well now.

tools/end2endblast/main.go line 306 at r7 (raw file):

Previously, matzf (Matthias Frei) wrote…

Nit: pongOut (also pingResult, pongResult, ...)

Done. Sorry for letting snakes in the camel pen.

tools/end2endblast/main.go line 359 at r7 (raw file):

Previously, matzf (Matthias Frei) wrote…

Nit: in contrast to end2end, this does not switch to alternative paths on errors. We could clean this up by moving this getRemote call out, before the first call of the blindPing.
We can also remove everything related to errorPaths

Done... and then some.

jiceatscion

Reviewable status: 12 of 21 files reviewed, all discussions resolved (waiting on @matzf)

doc/manuals/router.rst line 214 at r7 (raw file):

Previously, matzf (Matthias Frei) wrote…

Nit: use the :env: role, this directly creates a reference and the right formatting. Then we don't need the separate anchor _gomaxprocs.

done

That is: * Fix description of end2end.go. * Delete NilCtx(), no-longer used.

matzf

Reviewed 1 of 14 files at r1, 1 of 14 files at r6, 9 of 9 files at r8, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @jiceatscion)

matzf

Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @jiceatscion)

…cionproto#4437) * Enhance the end2end_integration programs as follows: * end2end_integration has a slightly finer-grain role-based pair selection capability. * end2end_integration outputs an extra result line that provides precise start/stop timestamps for the whole test. * process_metrics exposes the number of cores used by Go. * Add end2endblast. Similar to end2end but plays a game other than pingpong: packetblast. It keeps logging and tracing to a minimum. The server side responds to very few of the pings (1/256). Just enough to prove that the circuit works. * Add an integration test called router_benchmark: * relies on an especially crafted topology: router_bm.topo * uses end2end_integration with command end2endblast over that topology as a load test for the router * extracts and logs relevant performance metrics after the test * compares performances with a set of expectations if executed by CI. As a benchmark program, this is only a first approximation. Extending the end2end suite provided a quick solution but isn't ideal: it uses too many cores at once due to requiring 5 routers actively running for one of the tests. More work is in progress to replace this with some leaner setup where only one router is actually running.

jiceatscion added 15 commits November 6, 2023 19:09

router: first draft of a router load test.

0a8d2b8

router: add missing handlepanic in go routine.

25e19a0

router: simplification. No need to pass local variable channel to gor…

e3579b8

…outine closure. Just let it be captured.

router: Use subsets to collect performance for specific traffic types.

044da3d

Also suppress most pongs in the packetflood game, in order to allow some routers to have only IN or only OUT traffic.

router: fix pong handling in packetflood game, plus linter-pleasing.

2ab0cec

Changes: * Make server methods take a pointer receiver so pongs count isn't reset for each call. * Formatting and so on.

router: Improved performance test and needed metrics

067577c

Mainly: * Add mertic to expose number of cores used by Go. * Clarify running/runnable metrics. * Collect the correct performance data from the test run. * Fixed the router_bm topo so that there actually are two ISDs.

router: improve benchmarking metrics accuracy.

95de5df

Have the test driver supply the best metrics time-window to the test harness.

router: make router perf test more reliable.

f3d76b5

Rely on wait_connectivity rather than the passage of time.

router: restore bug which is fied in trunk.

7628689

This is in the faint hope of evading a merge conflict.

router: Unfixing that bug was the wrong move.

eb01da4

...Or may be I had already rebased on top of the trunk's fix. Don't know.

jiceatscion requested a review from a team as a code owner November 6, 2023 18:36

jiceatscion added 13 commits November 7, 2023 10:33

router: fix formating.

363519a

Had to skip the autogenerated docs in rules_openapi/tools as it seems we have no control on those and mdlint doesn't like them anymore.

router: catch-up with the docker-compose update.

a07abf1

router: Add expectations and checks to the router_benchmark test.

37f1096

Checks are only performed during CI. Expectations have been extrapolated from one initial CI run.

router: fix formating and cmd line argument.

ab970ab

router: pass the --ci flag only to the router_benchmark test for now.

6a1aaf5

router: strange quote misshap in pipeline_lib.sh

d655f57

router: add missing fi.

d40ff53

That was the problem.

router: Make the benchmark test log the number of cores.

7c521c9

Just so that I can easily see what the CI environment is like.

router: benchmark output improvements.

10012c6

Delete edundant log. Move core count cpature and output to the end.

router: have the benchmark use a 10 second window.

fe480c8

Five seconds gave too much variability. It seems that pkt rate and cpu rate don't align very well or that cpu rate is very volatile. That caused pkts/cpu-sec to have large random spikes. With 10s the spikes are gone.

router: fix formatting.

0dcc190

router: lower expectations on the regression test.

dbcf003

Since switching to a 10s window the average observed performance may have been reduced a bit.

matzf suggested changes Nov 9, 2023

View reviewed changes

jiceatscion added 2 commits November 9, 2023 16:57

router: Increase expectations from benchmark.

b655c50

Observed numbers have about double after improving test.

router: implement reviewers suggestions.

934f992

jiceatscion commented Nov 9, 2023

View reviewed changes

jiceatscion added 2 commits November 10, 2023 16:54

router: restore pipeline_lib.sh to its original state.

180bbc7

matzf suggested changes Nov 14, 2023

View reviewed changes

jiceatscion commented Nov 14, 2023

View reviewed changes

router: split the packetflood game out of the end2end binary.

9004f87

Moved it to a sibling binary named end2endblast. As a result end2end* do not take a "game" argument any more.

jiceatscion commented Nov 15, 2023

View reviewed changes

jiceatscion added 2 commits November 15, 2023 18:37

router: fix ref in doc.

49826fa

router: missing word in doc.

65a4fe5

matzf reviewed Nov 16, 2023

View reviewed changes

router: Implement reviewers suggestions.

eea565f

Namely: * Doc formating * Simplify end2endblast now that its function is separate from that of end2end. * Restore endend almost completely to its original form for the same reason.

jiceatscion commented Nov 16, 2023

View reviewed changes

jiceatscion added 5 commits November 16, 2023 13:28

router: leftovers cleanup.

91b7e22

That is: * Fix description of end2end.go. * Delete NilCtx(), no-longer used.

router: erase more unnecessary changes.

02ce0ca

router: another formating issue.

4a9acf4

router: update copyright notices.

602bc1f

router: increase test timeout to allow for slower CI system.

90394e7

matzf approved these changes Nov 16, 2023

View reviewed changes

Merge branch 'master' into router_benchmark

dba5192

matzf changed the title ~~Add a router benchmark program and also use it as integration test~~ testing: add router benchmark program and use it as integration test Nov 17, 2023

matzf merged commit 2136a63 into scionproto:master Nov 17, 2023
4 checks passed

jiceatscion deleted the router_benchmark branch December 7, 2023 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testing: add router benchmark program and use it as integration test #4437

testing: add router benchmark program and use it as integration test #4437

jiceatscion commented Nov 6, 2023 •

edited

Loading

matzf left a comment

jiceatscion left a comment

matzf left a comment

jiceatscion left a comment

jiceatscion left a comment

jiceatscion left a comment

matzf left a comment

jiceatscion left a comment

jiceatscion left a comment

matzf left a comment

matzf left a comment

testing: add router benchmark program and use it as integration test #4437

testing: add router benchmark program and use it as integration test #4437

Conversation

jiceatscion commented Nov 6, 2023 • edited Loading

matzf left a comment

Choose a reason for hiding this comment

jiceatscion left a comment

Choose a reason for hiding this comment

matzf left a comment

Choose a reason for hiding this comment

jiceatscion left a comment

Choose a reason for hiding this comment

jiceatscion left a comment

Choose a reason for hiding this comment

jiceatscion left a comment

Choose a reason for hiding this comment

matzf left a comment

Choose a reason for hiding this comment

jiceatscion left a comment

Choose a reason for hiding this comment

jiceatscion left a comment

Choose a reason for hiding this comment

matzf left a comment

Choose a reason for hiding this comment

matzf left a comment

Choose a reason for hiding this comment

jiceatscion commented Nov 6, 2023 •

edited

Loading