Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource monitoring and alerts #176

Merged
merged 9 commits into from
Oct 19, 2023
Merged

Resource monitoring and alerts #176

merged 9 commits into from
Oct 19, 2023

Conversation

hardillb
Copy link
Contributor

part of FlowFuse/flowfuse#2755

Description

Adds a NR plugin to expose a prometheus metrics endpoint

Then has the nr-launcher scrape that endpoint to generate resource usage stats.

Will generate alerts if thresholds passed.

Related Issue(s)

FlowFuse/flowfuse#2755

Checklist

  • I have read the contribution guidelines
  • Suitable unit/system level tests have been added and they pass
  • Documentation has been updated
    • Upgrade instructions
    • Configuration details
    • Concepts
  • Changes flowforge.yml?
    • Issue/PR raised on FlowFuse/helm to update ConfigMap Template
    • Issue/PR raised on FlowFuse/CloudProject to update values for Staging/Production

Labels

  • Backport needed? -> add the backport label
  • Includes a DB migration? -> add the area:migration label

@hardillb hardillb added this to the 1.13 milestone Oct 17, 2023
@hardillb hardillb self-assigned this Oct 17, 2023
@hardillb hardillb marked this pull request as ready for review October 18, 2023 15:54
@hardillb hardillb requested a review from knolleary October 18, 2023 15:54
@knolleary knolleary merged commit 97c71cc into main Oct 19, 2023
@knolleary knolleary deleted the resource-monitoring branch October 19, 2023 08:20
@knolleary
Copy link
Member

Sorry, hit merge a moment too late.

We need tests. sampleBuffer in particular.

}
})
} catch (err) {
response.err = err.message
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do these cases get handled by the sampleBuffer. As far as I can see, there's no filtering of them, so it will try to generate averages of the err property...

Given we know a locked-up node-red process will generate errors, this is a scenario we have to handle well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the sample has an error it is not added to the average, which will skew the avg down.

Will think about that some more

const result = {}
samples.forEach(sample => {
for (const [key, value] of Object.entries(sample)) {
if (key !== 'ts' && key !== 'err') {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we skip samples with errors. which is going to skew the average down. We could remove 1 from the sample count for each skipped sample, which given the current sample average period is a lot longer than unhealthy time out this should work out.


avgLastX (x) {
const samples = this.lastX(x)
const result = {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const result = {}
const result = {}
let skipped=0

Comment on lines +69 to +74
}
}
})
for (const [key, value] of Object.entries(result)) {
result[key] = value/samples.length
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
}
})
for (const [key, value] of Object.entries(result)) {
result[key] = value/samples.length
}
} else {
skipped++
}
}
})
for (const [key, value] of Object.entries(result)) {
result[key] = value/(samples.length-skipped)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants