Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

router: extend packet processing metrics #4422

Merged
merged 29 commits into from
Oct 30, 2023

Conversation

jiceatscion
Copy link
Contributor

@jiceatscion jiceatscion commented Oct 16, 2023

Improve metrics for router packet processing:

  • Classify packets by size bucket: log2(size)
  • Classify packets by traffic type: incoming, outgoing, as-transit-in/out, br-transit.
  • Add process-level on-cpu and sleep time accounting

closes #4409


This change is Reviewable

jiceatscion and others added 11 commits October 4, 2023 19:14
… metrics.

Packet counters are labeled with a "crossing" flag, which denotes
whether a packet is crossing the local AS boundary or not.

Combined the existing separate ingress/egress counters and the existing
interface labels (from which we can tell internal from external interfaces),
this allows to counts packets according to 4 types of processing:

incoming: from another AS to the local AS
outgoing: from the local AS to another AS
local: from the local AS to the local AS
transit: from another AS to another AS

It should be noted that these counters are about a single router,
so a packet that transits through an AS via two routers will be
counted by each router independently: once as incoming and once as
outgoing. Not as transit.
A bit of naming cleanup ensued.
* Added error logging.
* Added preemption count reporting.
The latter provides the time spend in runnable-not-running state
which is what we truly need to do scheduling-aware performance
observations.
@jiceatscion jiceatscion requested a review from matzf as a code owner October 16, 2023 08:38
* Massage (interface=internal/not-internal, crossing=t/f) into a traffic type,
so we don't have to do it with interminably long queries.
* This required updating prometheus to the latest, which is just as well.
@matzf matzf changed the title Router metrics router: extend packet processing metrics Oct 18, 2023
The collected metrics are originally per-threads. Collecting them means
knowing what threads exist. The procfs API isn't smart about it and generates
work for the GC. To prevent it, we stat /proc/pid/tasks ourselves and check if
the count of files has changed. Go never kills threads; only adds them. So
any set of changes is a change of count.

Added a metric to get an idea of how often we actually have to pay the price:
process_metrics_tasklist_updates_total
Copy link
Contributor

@matzf matzf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 15 files at r2.
Reviewable status: 7 of 16 files reviewed, 6 unresolved discussions (waiting on @jiceatscion)


pkg/private/processmetrics/BUILD.bazel~ line 0 at r4 (raw file):
Accidentally commited backup file? Same for processmetrics.go~. Btw, feel free to add a gitignore pattern for emacs.


pkg/private/processmetrics/processmetrics_linux.go line 43 at r4 (raw file):

	// runningTime should be the same as uTime+sTime reported in a variety of other ways,
	// but when doing calculations, better use the two data from the same source. So, collect them
	// both.

Since go 1.20, the runtime/metrics package has a bunch of CPU metrics. In particular /cpu/classes/idle:cpu-seconds and /cpu/classes/total appear to be the value that we're after. It appears to be tracked by the go runtime and thus is likely not exactly the same value as the one exposed by the OS scheduler.
I believe this data should be exported by prometheus as go_cpu_classes_idle_cpu_seconds_total and go_cpu_classes_total_cpu_seconds_total. For some reason, I couldn't make this happen though yet, I'll have another look. If this works, it seems that we don't even need this custom collector here.


pkg/private/processmetrics/processmetrics_linux.go line 70 at r4 (raw file):

	// FIXME (if we can): AllThreads builds a list and it
	// ends-up on the garbage pile.
	myProcs, err := procfs.AllThreads(c.myPid)

I think it would be safe to get this list of threads during the initialization and then keep using it. As far as I understand, the go runtime spins up threads during the initialization


pkg/private/processmetrics/processmetrics_linux.go line 80 at r4 (raw file):

	// and remove only the unupdated ones, but Go makes map values un-addressible. We'd need
	// to use pointers instead, which would probably make things worse. TBD.
	for pid := range c.lastSchedstats {

Could use clear(c.lastSchedstats) (new in go 1.21).

If garbage is a concern, couldn't we just some over the threads already here, instead of keeping the map and then summing later?


pkg/private/processmetrics/processmetrics_linux.go line 133 at r4 (raw file):

// It is safe to ignore errors from this but prometheus may lack some
// metrics.
func NewProcStatCollector() error {

I'd expect that New... returns something. If this directly registers the collector, perhapsInit.. or Register... would be clearer.


pkg/private/processmetrics/processmetrics_linux.go line 149 at r4 (raw file):

	err = prometheus.Register(c)
	if err != nil {
		log.Error("NewProcStatCollector", "registration failed", err)

IMO, ideally we'd move the logging outside the library in the calling code.

Per reviewer's suggestion:
* Rename Newxyz to Init(), since not returning anything.
* Moved logging to caller side.
* Normalized errors a bit.
Copy link
Contributor Author

@jiceatscion jiceatscion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 7 of 16 files reviewed, 3 unresolved discussions (waiting on @matzf)


pkg/private/processmetrics/BUILD.bazel~ line at r4 (raw file):

Previously, matzf (Matthias Frei) wrote…

Accidentally commited backup file? Same for processmetrics.go~. Btw, feel free to add a gitignore pattern for emacs.

Done.


pkg/private/processmetrics/processmetrics_linux.go line 43 at r4 (raw file):

Previously, matzf (Matthias Frei) wrote…

Since go 1.20, the runtime/metrics package has a bunch of CPU metrics. In particular /cpu/classes/idle:cpu-seconds and /cpu/classes/total appear to be the value that we're after. It appears to be tracked by the go runtime and thus is likely not exactly the same value as the one exposed by the OS scheduler.
I believe this data should be exported by prometheus as go_cpu_classes_idle_cpu_seconds_total and go_cpu_classes_total_cpu_seconds_total. For some reason, I couldn't make this happen though yet, I'll have another look. If this works, it seems that we don't even need this custom collector here.

/cpu/classes/total:cpu-seconds seemed promising, but it is defined as "GOMAXPROCS integrated over the wall-clock duration this process has been executing for." Which means that includes all periods of "Runnable" state. Where the process is kept off the CPU by preemption. Every existing metric I looked at, other than "schedstats" has that problem. The Go doc is even brutally honest about the disconnection with reality: "Compare only with other /cpu/classes metrics". So, I'd rather use something the meaning of which I'm pretty sure is what I need.


pkg/private/processmetrics/processmetrics_linux.go line 70 at r4 (raw file):

Previously, matzf (Matthias Frei) wrote…

I think it would be safe to get this list of threads during the initialization and then keep using it. As far as I understand, the go runtime spins up threads during the initialization

I have found differently. The line-up increases in 2 or 3 increments before stabilizing. The number of thread never goes down, though. That is confirmed by several feature requests since 2016, never addressed. As you see in my latest revision. I now refresh the list only when needed. That's 2 or 3 times total.


pkg/private/processmetrics/processmetrics_linux.go line 80 at r4 (raw file):

Previously, matzf (Matthias Frei) wrote…

Could use clear(c.lastSchedstats) (new in go 1.21).

If garbage is a concern, couldn't we just some over the threads already here, instead of keeping the map and then summing later?

You're right. I was trying to stick to a more general pattern of gathering everything and be free to report what we want later. However, the cost greatly exceeds the benefits. Simplified the map out.
done


pkg/private/processmetrics/processmetrics_linux.go line 133 at r4 (raw file):

Previously, matzf (Matthias Frei) wrote…

I'd expect that New... returns something. If this directly registers the collector, perhapsInit.. or Register... would be clearer.

Done.


pkg/private/processmetrics/processmetrics_linux.go line 149 at r4 (raw file):

Previously, matzf (Matthias Frei) wrote…

IMO, ideally we'd move the logging outside the library in the calling code.

Done.

Copy link
Contributor

@matzf matzf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 15 files at r2, 6 of 7 files at r6, all commit messages.
Reviewable status: 11 of 17 files reviewed, 4 unresolved discussions (waiting on @jiceatscion)

a discussion (no related file):
As discussed offline, it seems clearer to additionally record the "ingress" interface only for output packets, instead of this crossing label. This makes the label "schemas" different for processed/output packets vs input and dropped packets though (but this does seem to make conceptual sense).



pkg/private/processmetrics/processmetrics_linux.go line 43 at r4 (raw file):

Previously, jiceatscion wrote…

/cpu/classes/total:cpu-seconds seemed promising, but it is defined as "GOMAXPROCS integrated over the wall-clock duration this process has been executing for." Which means that includes all periods of "Runnable" state. Where the process is kept off the CPU by preemption. Every existing metric I looked at, other than "schedstats" has that problem. The Go doc is even brutally honest about the disconnection with reality: "Compare only with other /cpu/classes metrics". So, I'd rather use something the meaning of which I'm pretty sure is what I need.

💯


router/dataplane.go line 986 at r6 (raw file):

				writeMsgs[i].Buffers[0] = writeMsgs[i+written+1].Buffers[0]
				writeMsgs[i].Addr = writeMsgs[i+written+1].Addr
			}

This seems to miss the "transport" of the corresponding ingressID slice, so the accounting will be wrong.

This code was "optimized" for least amount of mental overhead, that's why only the IPv4 messages are tracked here. If we now want to track packet metadata, it's probably much better to reorganize this so that:

  • readUpTo fills a list of packet (instead of ipv4.Message and new also ingressIDs)
  • we only keep track of the packets to be written
  • translate packet to ipv4.Message immediately before the WriteBatch call and dont use the ipv4.Message otherwise

router/dataplane.go line 2465 at r6 (raw file):

// packet's size is <128k. Just in case that would change, larger packets are simply put in the
// last class.
const maxSizeClass = 18

The largest supported packet size is private/comnmon.SupportedMTU == 9188, so maxSizeClass can be 14.


router/dataplane.go line 2488 at r6 (raw file):

// Returns the sizeClass of the given size. All classes above
// maxSizeClass - 1 are conflated with maxSizeClass - 1.

Would it make sense to also introduce a minSizeClass or so? Distinguishing the tiny packets probably does not give any benefit, but they make up the bulk of the different size classes. Everything below 128 bytes or so is just "very small".

Copy link
Contributor

@matzf matzf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 15 files at r2, 6 of 7 files at r6, all commit messages.
Reviewable status: 11 of 17 files reviewed, 4 unresolved discussions (waiting on @jiceatscion)

The "crossing" label had dubious semantics with the goal of keeping the same
schema for all packet metrics. Instead, as reviewer suggested, accepted
that only the output packets really deserved a traffic type and gave up
on the same-schema requirements (joining input and output metrics wouldn't
be that meaningful anyway). This greatly simplifies things.

Now, the real traffic type (not just the "crossing" flag) can be computed
by process(), where it's simple, and kept until the packets are output and
the metrics captured.

The traffic type is also richer than what could be derived from the
crossing flag.
Copy link
Contributor Author

@jiceatscion jiceatscion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 10 of 17 files reviewed, 3 unresolved discussions (waiting on @matzf)

a discussion (no related file):

Previously, matzf (Matthias Frei) wrote…

As discussed offline, it seems clearer to additionally record the "ingress" interface only for output packets, instead of this crossing label. This makes the label "schemas" different for processed/output packets vs input and dropped packets though (but this does seem to make conceptual sense).

I have refactored accordingly and then some. There's now a fine grain traffic type label for output packets/bytes metrics only.



router/dataplane.go line 986 at r6 (raw file):

Previously, matzf (Matthias Frei) wrote…

This seems to miss the "transport" of the corresponding ingressID slice, so the accounting will be wrong.

This code was "optimized" for least amount of mental overhead, that's why only the IPv4 messages are tracked here. If we now want to track packet metadata, it's probably much better to reorganize this so that:

  • readUpTo fills a list of packet (instead of ipv4.Message and new also ingressIDs)
  • we only keep track of the packets to be written
  • translate packet to ipv4.Message immediately before the WriteBatch call and dont use the ipv4.Message otherwise

I have fixed the bug. I have yet to implement the refactoring you suggest....tomorrow.


router/dataplane.go line 2465 at r6 (raw file):

Previously, matzf (Matthias Frei) wrote…

The largest supported packet size is private/comnmon.SupportedMTU == 9188, so maxSizeClass can be 14.

If I do that, then I have to keep track of SupportedMTU. Which, unlike the maximum supported by the protocol, could change easily.


router/dataplane.go line 2488 at r6 (raw file):

Previously, matzf (Matthias Frei) wrote…

Would it make sense to also introduce a minSizeClass or so? Distinguishing the tiny packets probably does not give any benefit, but they make up the bulk of the different size classes. Everything below 128 bytes or so is just "very small".

done

…upported by the rest of the code.

Previously the largest size per the spec was supported.
Copy link
Contributor Author

@jiceatscion jiceatscion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 10 of 17 files reviewed, 3 unresolved discussions (waiting on @matzf)


router/dataplane.go line 2465 at r6 (raw file):

Previously, jiceatscion wrote…

If I do that, then I have to keep track of SupportedMTU. Which, unlike the maximum supported by the protocol, could change easily.

Ok. I added an init function to assert for the size. maxSizeClass is now 15 (it is the first not-supported size).
Done.

This is to avoid remove the code that supports it and having to reinvent it
later if we need it, which is likely.
Copy link
Contributor Author

@jiceatscion jiceatscion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 10 of 17 files reviewed, 3 unresolved discussions (waiting on @matzf)


router/dataplane.go line 986 at r6 (raw file):

Previously, jiceatscion wrote…

I have fixed the bug. I have yet to implement the refactoring you suggest....tomorrow.

Done.
I'll let you judge if it's really better than before. As long as we can't pass []Packet to writeBatch, there isn't a whole lot we can do to avoid managing Packets and Messages in parallel (without sacrificing performance).

The cost ends-up being the same: unsent packets yield to repeated
copying to writeMsg. In one case by shifting the writeMsg buffer.
In the other be recreating writeMsg elements from pkts.
The latter is just simpler.
Copy link
Contributor

@matzf matzf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 10 of 17 files reviewed, 7 unresolved discussions (waiting on @jiceatscion)

a discussion (no related file):

Previously, jiceatscion wrote…

I have refactored accordingly and then some. There's now a fine grain traffic type label for output packets/bytes metrics only.

👍



router/dataplane.go line 986 at r6 (raw file):
I think it would help if we get rid of updating the writeMsgs in this loop here and just stupidly "copy" the entire readPkts into writeMsgs everytime. I doubt this will hurt.

// Note: we could use a disposable array for that, but we'd have to redo the copies for the packets that we retry.

Also, can we name it just pkts or so? "read" being this stupid verb without proper participle makes this a bit confusing...


router/dataplane.go line 2465 at r6 (raw file):

Previously, jiceatscion wrote…

Ok. I added an init function to assert for the size. maxSizeClass is now 15 (it is the first not-supported size).
Done.

Nice. There's a (somewhat cryptic) way to make this a compile time check instead:

const _ = uint(1 << (maxSizeClass-1) - bufSize) // compile time assert: ensure that 2 ^ (maxSizeClass-1) >= bufSize

Checking at compile time avoids that this shows up only very late in the refactoring process, after someone has already declared success when things compile.
Not blocking, if you think it's too obscure just leave it as is.


router/dataplane.go line 953 at r7 (raw file):

		}

		// We need to collect stats by traffic type and size class.

This accounting got a bit verbose now. Would it be possible to move this into a separate function? accountWrittenPkts(metrics, pkts[:written]) or so?
The call to returnPacketToPool should perhaps still be here though for clarity.


router/dataplane.go line 958 at r7 (raw file):

		writtenPkts := [ttMax][maxSizeClass]int{}
		writtenBytes := [ttMax][maxSizeClass]int{}
		for i := range writeMsgs[:written] {

Nit Replace writeMsgs here with readPkts in the loop.


router/dataplane.go line 1001 at r7 (raw file):

func readUpTo(c <-chan packet,
	n int, needsBlocking bool, pkts []packet) int {

Nit: styleguide for multiline function declarations is: "If a function declaration uses more than 1 line, each parameter should be declared on a separate line and the first line of the function body should be empty"

This one probably still fits on a single line though :)


router/dataplane.go line 1078 at r7 (raw file):

	pld := p.lastLayer.LayerPayload()
	minResult := processResult{}

No longer useful, please revert to returning processResult{}.


router/dataplane.go line 2424 at r7 (raw file):

}

// interfaceMetrics is the set of metrics that are relevant for one given interface.

We could move all of metrics stuff (~things below this line) into metrics.go. I think the (*Dataplane).initMetrics should stay, but for the rest this would seem fairly appropriate, and we'd avoid further growing this gargantuan dataplane.go thing. What do you think?


router/dataplane.go line 2523 at r7 (raw file):

		return "inTransit"
	case ttOutTransit:
		return "outTransit"

We use snake_case for the metric labels and values.

In passing, apply reviewer's suggestion of replacing the init()
function with a static assertion.
* Improved variables naming
* move metrics updates out of runForwarder
Copy link
Contributor Author

@jiceatscion jiceatscion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 10 of 17 files reviewed, 4 unresolved discussions (waiting on @matzf)


router/dataplane.go line 986 at r6 (raw file):

Previously, matzf (Matthias Frei) wrote…

I think it would help if we get rid of updating the writeMsgs in this loop here and just stupidly "copy" the entire readPkts into writeMsgs everytime. I doubt this will hurt.

// Note: we could use a disposable array for that, but we'd have to redo the copies for the packets that we retry.

Also, can we name it just pkts or so? "read" being this stupid verb without proper participle makes this a bit confusing...

Done
I came to the same conclusion. Did some more variable renaming.


router/dataplane.go line 2465 at r6 (raw file):

Previously, matzf (Matthias Frei) wrote…

Nice. There's a (somewhat cryptic) way to make this a compile time check instead:

const _ = uint(1 << (maxSizeClass-1) - bufSize) // compile time assert: ensure that 2 ^ (maxSizeClass-1) >= bufSize

Checking at compile time avoids that this shows up only very late in the refactoring process, after someone has already declared success when things compile.
Not blocking, if you think it's too obscure just leave it as is.

Done


router/dataplane.go line 953 at r7 (raw file):

Previously, matzf (Matthias Frei) wrote…

This accounting got a bit verbose now. Would it be possible to move this into a separate function? accountWrittenPkts(metrics, pkts[:written]) or so?
The call to returnPacketToPool should perhaps still be here though for clarity.

Done


router/dataplane.go line 958 at r7 (raw file):

Previously, matzf (Matthias Frei) wrote…

Nit Replace writeMsgs here with readPkts in the loop.

Done


router/dataplane.go line 1001 at r7 (raw file):

Previously, matzf (Matthias Frei) wrote…

Nit: styleguide for multiline function declarations is: "If a function declaration uses more than 1 line, each parameter should be declared on a separate line and the first line of the function body should be empty"

This one probably still fits on a single line though :)

Done
fits one line again now.


router/dataplane.go line 1078 at r7 (raw file):

Previously, matzf (Matthias Frei) wrote…

No longer useful, please revert to returning processResult{}.

Done


router/dataplane.go line 2424 at r7 (raw file):

Previously, matzf (Matthias Frei) wrote…

We could move all of metrics stuff (~things below this line) into metrics.go. I think the (*Dataplane).initMetrics should stay, but for the rest this would seem fairly appropriate, and we'd avoid further growing this gargantuan dataplane.go thing. What do you think?

Done
I agree.


router/dataplane.go line 2523 at r7 (raw file):

Previously, matzf (Matthias Frei) wrote…

We use snake_case for the metric labels and values.

Done

Copy link
Contributor

@matzf matzf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 15 files at r2.
Reviewable status: 11 of 17 files reviewed, 10 unresolved discussions (waiting on @jiceatscion)


router/dataplane.go line 626 at r10 (raw file):

		// For non-broken packets, we defer the ingress-side metrics accounting
		// until we have a chance to figure the traffic type.
		// That's in runProcessor and processPkt.

This comment is no longer accurate.


router/dataplane.go line 927 at r10 (raw file):

	}
	for t := ttOther; t < ttMax; t++ {
		for sc := sizeClass(0); sc < maxSizeClass; sc++ {

for sc := minSizeClass; ...


router/dataplane.go line 930 at r10 (raw file):

			if writtenPkts[t][sc] > 0 {
				metrics[sc].Output[t].OutputPacketsTotal.Add(
					float64(writtenPkts[t][sc]))

Nit: fits on one line


router/dataplane.go line 941 at r10 (raw file):

	cfg *RunConfig, c <-chan packet) {

	fmt.Println("Initialize forwarder for", "interface")

Leftover printf


router/dataplane.go line 959 at r10 (raw file):

	for d.running {
		toWrite += readUpTo(c, cfg.BatchSize-toWrite, toWrite == 0,
			pkts[toWrite:])

Nit: also seems to fit onto a single line now


router/export_test.go line 70 at r10 (raw file):

func InitDPMetrics(dp *DataPlane) {
	dp.initMetrics()

Move this into NewDP? It seems we always call this now


router/metrics.go line 220 at r10 (raw file):

func (sc sizeClass) String() string {
	return strings.Join(
		[]string{strconv.Itoa((1 << sc) >> 1), strconv.Itoa(1<<sc - 1)},

Nit: this is not accurate for minSizeClass which would now show as 32_63 (or so.. off by some), instead of 0_63.
Btw, Sprintf() might be simpler here.


router/metrics.go line 227 at r10 (raw file):

// trafficMetrics groups all the metrics instances that all share the same interface AND
// sizeClass label values (but have different names - i.e. they count different things).
type trafficMetrics struct {

Nit: I think it would help readability if we keep the type definitions of these metrics structs close together:

type interfaceMetrics map[sizeClass]trafficMetrics
struct trafficMetrics { ... }
struct outputMetrics { ... }

... size class, traffic type, ...

router/metrics.go line 235 at r10 (raw file):

	DroppedPacketsBusySlowPath  prometheus.Counter
	ProcessedPackets            prometheus.Counter
	Output                      map[trafficType]outputMetrics

Not sure if it's worth optimizing this, but this could be an array instead of a map, Output [ttMax]outputMetrics.
Same for the size classes, actually, but there it's a bit more awkward because of the min offset.

I think this would allow to turn outputMetrics into an anonymous inline struct definition, which potentially makes this easier to navigate.


router/metrics.go line 285 at r10 (raw file):

// counters are already instantiated for all the relevant interfaces so this
// will not have to be repeated during packet forwarding.
func (d *DataPlane) initMetrics() {

My preference would be to keep this function together with all the otherDataplane methods in the dataplane.go file.

If we slightly amend the interface of initInterfaceMetrics so that it takes id uint16, localIA addr.IA, neighbors map[uint16]addr.IA instead of labels and does the call to interfaceToMetricsLabels internally, this can be the only function that is exposed from this metrics.go file.

Copy link
Contributor

@matzf matzf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 11 of 17 files reviewed, 11 unresolved discussions (waiting on @jiceatscion)


tools/topology/monitoring.py line 140 at r10 (raw file):

            if relabels is not None:
                job_scrape_config['metric_relabel_configs'] = relabels
            scrape_configs.append(job_scrape_config)

Left over?

* deleted InitDPMetrics and moved what it does to NewDP.
* moved initMetrics back to dataplane.go because it is a dataplane method.
* shifted some of the label manipulation in initMetrics into initInterfaceMetrics to minimize dataplane's exposure to metrics mechanics.
* sorted declarations in metrics.go to keep important structs together.
* fixed string representation of size class label values
* implement trafficMetrics.Output as an array instead of a map.
* renamed some init* functions to new*, to better eflect what they do.
* reformated to 100 columns.
Copy link
Contributor Author

@jiceatscion jiceatscion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 11 of 17 files reviewed, 8 unresolved discussions (waiting on @matzf)


router/dataplane.go line 626 at r10 (raw file):

Previously, matzf (Matthias Frei) wrote…

This comment is no longer accurate.

Done


router/dataplane.go line 927 at r10 (raw file):

Previously, matzf (Matthias Frei) wrote…

for sc := minSizeClass; ...

Done


router/dataplane.go line 930 at r10 (raw file):

Previously, matzf (Matthias Frei) wrote…

Nit: fits on one line

Done


router/dataplane.go line 941 at r10 (raw file):

Previously, matzf (Matthias Frei) wrote…

Leftover printf

Done


router/dataplane.go line 959 at r10 (raw file):

Previously, matzf (Matthias Frei) wrote…

Nit: also seems to fit onto a single line now

Done


router/export_test.go line 70 at r10 (raw file):

Previously, matzf (Matthias Frei) wrote…

Move this into NewDP? It seems we always call this now

Done
(Yeah, looks that way, now)


router/metrics.go line 220 at r10 (raw file):

Previously, matzf (Matthias Frei) wrote…

Nit: this is not accurate for minSizeClass which would now show as 32_63 (or so.. off by some), instead of 0_63.
Btw, Sprintf() might be simpler here.

Done
Ah, right. Here you are. Upper bound also promoted to "inf".


router/metrics.go line 227 at r10 (raw file):

Previously, matzf (Matthias Frei) wrote…

Nit: I think it would help readability if we keep the type definitions of these metrics structs close together:

type interfaceMetrics map[sizeClass]trafficMetrics
struct trafficMetrics { ... }
struct outputMetrics { ... }

... size class, traffic type, ...

K. Sorted differently.


router/metrics.go line 235 at r10 (raw file):

Previously, matzf (Matthias Frei) wrote…

Not sure if it's worth optimizing this, but this could be an array instead of a map, Output [ttMax]outputMetrics.
Same for the size classes, actually, but there it's a bit more awkward because of the min offset.

I think this would allow to turn outputMetrics into an anonymous inline struct definition, which potentially makes this easier to navigate.

  1. Done
  2. We'll talk.

router/metrics.go line 285 at r10 (raw file):

Previously, matzf (Matthias Frei) wrote…

My preference would be to keep this function together with all the otherDataplane methods in the dataplane.go file.

If we slightly amend the interface of initInterfaceMetrics so that it takes id uint16, localIA addr.IA, neighbors map[uint16]addr.IA instead of labels and does the call to interfaceToMetricsLabels internally, this can be the only function that is exposed from this metrics.go file.

Done. Good idea. Worked out nicely. I took the opportunity to rename the init(.*)Metrics to new(\1)Metrics since that's what these functions ~are. Unfortunately your wish isn't entirely granted: serviceMetricLabels() is still referenced directly somewhere in dataplane.go


tools/topology/monitoring.py line 140 at r10 (raw file):

Previously, matzf (Matthias Frei) wrote…

Left over?

No. I left it there deliberately as I predict that relabeling will be back soon, just for other reasons. I don't want to throw away support for it.

Copy link
Contributor

@matzf matzf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 4 files at r7, 4 of 4 files at r11, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @jiceatscion)


tools/topology/monitoring.py line 140 at r10 (raw file):

Previously, jiceatscion wrote…

No. I left it there deliberately as I predict that relabeling will be back soon, just for other reasons. I don't want to throw away support for it.

👍

Copy link
Contributor

@matzf matzf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r12, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @jiceatscion)

@matzf matzf merged commit fa43782 into scionproto:master Oct 30, 2023
1 check passed
matzf pushed a commit that referenced this pull request Nov 2, 2023
The interface label value of every metric instance had been lost to a refactoring in #4422.
@jiceatscion jiceatscion deleted the router_metrics branch November 13, 2023 16:06
juagargi pushed a commit to netsec-ethz/scion that referenced this pull request Mar 8, 2024
Improve metrics for router packet processing:

* Classify packets by size bucket: log2(size)
* Classify packets by traffic type: incoming, outgoing, as-transit-in/out, br-transit.
* Add process-level on-cpu and sleep time accounting
juagargi pushed a commit to netsec-ethz/scion that referenced this pull request Mar 8, 2024
The interface label value of every metric instance had been lost to a refactoring in scionproto#4422.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

## Define router performance metrics reproducible in (at least) a given developer env
2 participants