Resource monitoring and alerts #176

hardillb · 2023-10-17T14:25:37Z

part of FlowFuse/flowfuse#2755

Description

Adds a NR plugin to expose a prometheus metrics endpoint

Then has the nr-launcher scrape that endpoint to generate resource usage stats.

Will generate alerts if thresholds passed.

Related Issue(s)

FlowFuse/flowfuse#2755

Checklist

I have read the contribution guidelines
Suitable unit/system level tests have been added and they pass
Documentation has been updated
- Upgrade instructions
- Configuration details
- Concepts
Changes flowforge.yml?
- Issue/PR raised on FlowFuse/helm to update ConfigMap Template
- Issue/PR raised on FlowFuse/CloudProject to update values for Staging/Production

Labels

Backport needed? -> add the backport label
Includes a DB migration? -> add the area:migration label

This should reduce size and sent over network

And pass time period

lib/launcher.js

knolleary · 2023-10-19T08:22:26Z

Sorry, hit merge a moment too late.

We need tests. sampleBuffer in particular.

knolleary · 2023-10-19T08:42:39Z

lib/resources/sample.js

+            }
+        })
+    } catch (err) {
+        response.err = err.message


How do these cases get handled by the sampleBuffer. As far as I can see, there's no filtering of them, so it will try to generate averages of the err property...

Given we know a locked-up node-red process will generate errors, this is a scenario we have to handle well.

If the sample has an error it is not added to the average, which will skew the avg down.

Will think about that some more

lib/resources/sample.js

hardillb · 2023-10-19T08:59:35Z

lib/resources/sampleBuffer.js

+        const result = {}
+        samples.forEach(sample => {
+            for (const [key, value] of Object.entries(sample)) {
+                if (key !== 'ts' && key !== 'err') {


Here we skip samples with errors. which is going to skew the average down. We could remove 1 from the sample count for each skipped sample, which given the current sample average period is a lot longer than unhealthy time out this should work out.

hardillb · 2023-10-19T09:02:00Z

lib/resources/sampleBuffer.js

+
+    avgLastX (x) {
+        const samples  = this.lastX(x)
+        const result = {}


Suggested change

const result = {}

const result = {}

let skipped=0

hardillb · 2023-10-19T09:02:05Z

lib/resources/sampleBuffer.js

+                }
+            }
+        })
+        for (const [key, value] of Object.entries(result)) {
+            result[key] = value/samples.length
+        }


Suggested change

}

}

})

for (const [key, value] of Object.entries(result)) {

result[key] = value/samples.length

}

} else {

skipped++

}

}

})

for (const [key, value] of Object.entries(result)) {

result[key] = value/(samples.length-skipped)

}

hardillb added 2 commits October 17, 2023 14:01

First pass at Instance resource monitoring

82a47ed

make labels shorter

62be25e

This should reduce size and sent over network

hardillb added this to the 1.13 milestone Oct 17, 2023

hardillb self-assigned this Oct 17, 2023

hardillb added 6 commits October 18, 2023 14:27

Add audit alerts

6429709

Fix lint

c878331

And pass time period

Parametize the audit log entry

cd388f4

Better layout

b6d1f32

Fix log event names and parameters

78ae70b

Fix lint

ccd270c

hardillb marked this pull request as ready for review October 18, 2023 15:54

hardillb requested a review from knolleary October 18, 2023 15:54

knolleary reviewed Oct 18, 2023

View reviewed changes

lib/launcher.js Show resolved Hide resolved

Clear resourcePoll on stop

ae20695

knolleary approved these changes Oct 19, 2023

View reviewed changes

knolleary merged commit 97c71cc into main Oct 19, 2023

knolleary deleted the resource-monitoring branch October 19, 2023 08:20

knolleary reviewed Oct 19, 2023

View reviewed changes

lib/resources/sample.js Show resolved Hide resolved

hardillb commented Oct 19, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource monitoring and alerts #176

Resource monitoring and alerts #176

hardillb commented Oct 17, 2023

knolleary commented Oct 19, 2023

knolleary Oct 19, 2023

hardillb Oct 19, 2023

hardillb Oct 19, 2023

hardillb Oct 19, 2023

hardillb Oct 19, 2023

Resource monitoring and alerts #176

Resource monitoring and alerts #176

Conversation

hardillb commented Oct 17, 2023

Description

Related Issue(s)

Checklist

Labels

knolleary commented Oct 19, 2023

knolleary Oct 19, 2023

Choose a reason for hiding this comment

hardillb Oct 19, 2023

Choose a reason for hiding this comment

hardillb Oct 19, 2023

Choose a reason for hiding this comment

hardillb Oct 19, 2023

Choose a reason for hiding this comment

hardillb Oct 19, 2023

Choose a reason for hiding this comment