-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Fleet Detection Plugin #6151
Conversation
✅ Docs Preview ReadyNo new or changed pages found. |
This comment has been minimized.
This comment has been minimized.
CI performance tests
|
7e971cf
to
cad95b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last but not least, could you make sure to make a perf test with this plugin enabled. We have a benchmark system in place called router-scale (check docs in our own confluence space), it would be great to create a flamegraph with these benchmarks to make sure it's not something that takes a lot of resources. Feel free to ask if you need help
"IntrospectionMode": { | ||
"description": "Which implementation of GraphQL schema introspection to use, if enabled", | ||
"oneOf": [ | ||
{ | ||
"description": "Use the new Rust-based implementation.", | ||
"enum": [ | ||
"new" | ||
], | ||
"type": "string" | ||
}, | ||
{ | ||
"description": "Use the old JavaScript-based implementation.", | ||
"enum": [ | ||
"legacy" | ||
], | ||
"type": "string" | ||
}, | ||
{ | ||
"description": "Use Rust-based and Javascript-based implementations side by side, logging warnings if the implementations disagree.", | ||
"enum": [ | ||
"both" | ||
], | ||
"type": "string" | ||
} | ||
] | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it part of this PR ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we kept having CI failures because of changes to the JSON schema for the config I think? And I thought this would fix it but it doesn't! Any ideas much appreciated!
...uter/src/configuration/snapshots/apollo_router__configuration__tests__schema_generation.snap
Outdated
Show resolved
Hide resolved
...uter/src/configuration/snapshots/apollo_router__configuration__tests__schema_generation.snap
Outdated
Show resolved
Hide resolved
87dfa7c
to
5e47ad8
Compare
Ok @bnjjj I've run the perf tests, building the router from this branch and further enabling metrics so that we can see these now being emitted. I'll upload the I have a few further questions as well to push us forward on this:
|
25ce977
to
492ab28
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonathanrainer to make a comparison could you just run the same benchmark on dev and you'll get something to compare with this branch
// We have to store a reference to the gauge otherwise it will be dropped once the plugin is | ||
// initialised, even though it still has data to emit | ||
freq_gauge: ObservableGauge<u64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BrynCooke Could you confirm it will handle the hot reload properly ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I believe it will, but do test
492ab28
to
c652b17
Compare
@bnjjj Ah yes, apologies should have thought of that, have done that below and redid the tests I posted above just so the comparison is easier. From looking at the memory figures it looks like the plugin does increase memory but it's not the constant increase we were seeing before, also the baseline it starts from in the test appears higher in the branched case which presumably isn't plugin related because it won't be running in the early part of the test (I imagine). Have attached the logs again, let me know if you think there's anything we need to worry about Dev Files Branch Files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jonathanrainer LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPU and memory should be gauges. We also talked about activate.
84e23d5
to
4770db7
Compare
@BrynCooke I think this might be ready for another review from you? |
if quota == "-1" { | ||
system_cpus | ||
} else { | ||
quota.parse::<u64>().unwrap() / period.parse::<u64>().unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure, are we feeling confident about these unwrap ?
} else { | ||
// If it's not max then divide the two to get an integer answer | ||
let (a, b) = readings.split_once(' ').unwrap(); | ||
a.parse::<u64>().unwrap() / b.parse::<u64>().unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure, are we feeling confident about these unwrap ?
Adds an initial plugin, that loads at startup and emits metrics for three simple cases: cpus, cpu_freq and total_memory.
Also improve how we handle refreshing the System object, rather than doing it in either the callback or the async task, contain that within a struct and do it there.
Ensures that gauges will survive hot reload
1106758
to
c1ec7c7
Compare
1. activate is made non-async. It must never fail and it must complete. async fns can halt execution before completion. 2. The spawned thread and channel for fleet detector is removed. There's no need for these. Gauges will be called periodically for export. 3. Telemetry is converted to PrivatePlugin to allow uniform calls to activate.
@jonathanrainer I've pushed a commit that makes things simpler. In particular Please check that you are happy with the change and also do some manual testing to ensure that things still work. |
Clap's default behaviour was causing this check to fall over.
Adds an initial plugin, that loads at startup and emits metrics for three simple cases: cpus, cpu_freq and total_memory. Not sure this is the correct approach for this, especially as this plugin will expand over time so willing to take any pointers in that regard.
Tested via using OpenTelemetry Collector with the following router config and OTEL config
And produced the following
Checklist
Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.
Notes
Footnotes
It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩