Module id fail #5261

mattdurham · 2023-09-21T13:00:28Z

PR Description

This handles the use case where a module gets added but with an invalid configuration, lets say a module.file with invalid config. Then gets resolved. Since the first module failed without running, the ID was added to the registry on NewModule but since Run was never called it never gets cleaned up.

Which issue(s) this PR fixes

Closes #4702

Notes to the Reviewer

I am not a huge fan of this solution. I tried an approach to automatically clean it up but since the error happens deep in the loader by the time the error bubbled up to the parent module it was to late.

I also thought about and scaffolded a cleanup lifecycle step but that felt to large of a change.

PR Checklist

CHANGELOG.md updated
Tests updated

erikbaranowski

LGTM, nice test case to validate this 💯

rfratto

I'm worried about this specific implementation too.

Ideally, the implementation of a component doesn't need to worry about previous iteration of that same component.

Does the module creation error still happen if ComponentNode is responsible for terminating/cleaning created modules on an evaluation failure or when the component shuts down?

mattdurham · 2023-09-21T18:42:05Z

The root is at https://github.com/grafana/agent/blob/main/pkg/flow/module.go#L48 , that adds the path to the registry/modules which is called from https://github.com/grafana/agent/blob/main/component/module/file/file.go#L64 then Update fails since the file doesn't exist so it returns nil for the component and the code at https://github.com/grafana/agent/blob/main/pkg/flow/module.go#L137C1-L137C1 never gets called to remove it.

This may also be an outgrowth of Run vs Update debate of what we should do in either.

IMO the cleanest step would be adding a Cleanup lifecycle that is called regardless if Run is called for a component, though that would require always returning a component.

rfratto · 2023-09-21T18:59:30Z

IMO the cleanest step would be adding a Cleanup lifecycle that is called regardless if Run is called for a component, though that would require always returning a component.

Who would be the caller of this?

In my mind, ComponentNode has the most control over doing this transparently to component implementations:

ComponentNode has access to the module controller instance and can remove modules on behalf of a component.
ComponentNode knows when it's constructing a component, and if that construction fails.
ComponentNode knows when a component exits (i.e., when Run exits).

So, ComponentNode could unregister components using the module controller directly in one of these cases:

The initial component construction fails.
The Run method exits.

Neither of those would require access to the constructed component, so it would still allow for component constructors to return nil (which is also important behavior for #4411 to work properly).

mattdurham · 2023-09-26T13:21:27Z

Sounds solid though using the ComponentNode would mean we would need to check for an interface or flag since it would make sense to move the Removal and Addition of module ids to the component node so only one thing is controlling the add/remove id lifecycle and it is one place.

mattdurham · 2023-09-26T13:23:13Z

Though after giving it a second thought using ComponentNode breaks the moment we add multiple modules being loaded. Or at least gets more complicated.

rfratto · 2023-09-26T23:50:40Z

Can you explain what you mean a little bit more? I'm confused by why multiple modules complicates the situation, and I'm not sure I'm following the "we would need to check for an interface or flag" bit either.

Maybe it's worth having a prototype of that approach so we can get on the same page?

mattdurham · 2023-10-02T12:59:18Z

Will post up some psuedocode later in the week to go over what I am thinking.

mattdurham · 2023-10-05T14:46:38Z

Updated with a cleaner approach.

pkg/flow/internal/controller/node_component.go

pkg/flow/module.go

rfratto · 2023-10-10T15:57:32Z

pkg/flow/internal/controller/node_component.go

+	// Check for failure on initial loading, if not then we need to remove any modules it the component may habe created.
+	// In the case where the component is not a module loader this is a noop.
+	cn.moduleFailureCheck.Do(func() {
+		if err != nil {
+			cn.moduleController.ClearModuleIDs()
+		}
+	})


There's still a problem here: what if a component fails to be constructed more than once?

Rather than using a sync.Once, this check could be moved down to line 289 (i.e., if cn.reg.Build fails) so you only clear module IDs for components which can't be built, no matter how many times the construction fails.

rfratto · 2023-10-10T18:08:38Z

Given that we found two rounds of bugs in the changes to the controller, can we add more behavioral tests to the controller for the issues we've found? At least these cases:

Module cache must not persist between multiple instances of the same component.
Module cache must not persist when a component is terminated.

If you identify more behaviors around modules that should be verified, please add tests for them too.

The controller is one of our critical pieces of code, so it should be harder to introduce bugs than it currently is and warrants caution in cases where coverage is low.

mattdurham · 2023-10-11T20:21:49Z

After reviewing feedback and trying a few different approaches, I really wanted to avoid touching the controller/node_component/loader code. Instead I moved the id check to the Run method. This means we have a slight delay on modules being registered/visible but I havent found an issue with that yet. I also think that is more representative of the running system.

Duplicate IDs don't happen anymore but left the check in. Duplicate registry metrics trigger first in the loader. This also means the removal and insertion of IDs are within the same code block which seems simpler. Granted a module loader could now queue multiple is ids of the same sort but no loader loads multiple ids at the moment. When we add support for loading multiple ids, that would fall upon the individual module loader itself to ensure that aren't loader. With the registry duplicate being the catch all. Ideally this would be a different error since its not pointing out the exact problem.

Going to set this as WIP while I poke at it a bit more in the morning.

rfratto

LGTM, some remarks (that don't need to be addressed) and one that needs an answer, but let's get this merged!

rfratto · 2023-10-16T19:29:22Z

component/registry.go

@@ -49,7 +49,7 @@ type Module interface {
 	//
 	// Run blocks until the provided context is canceled. The ID of a module as defined in
 	// ModuleController.NewModule will not be released until Run returns.
-	Run(context.Context)
+	Run(context.Context) error


Just as an aside: this would be considered a breaking change to the API, which is re-enforcing the idea for me that everything should be moved to internal for 1.0 until we're ready to start exposing parts of our code as stable APIs.

Agreed, this is mostly for the tests so we can check specific error conditions.

rfratto · 2023-10-16T19:32:03Z

pkg/flow/module.go

+	if err := c.o.parent.addModule(c); err != nil {
+		return err
+	}
+	defer c.o.parent.removeModule(c)


We should get this merged, but it's standing out as potentially concerning to me that this changes things such that the lifetime of a component and module are now different: a component exists within the controller whenever it's defined in a file, even if it's not running, but a module doesn't exist until it starts running.

I'm not sure if this will introduce any problems, so let's keep an eye for related issues once this is merged.

IMO this feels alright, the component is the parent of module(s), so the component life span should be greater than the modules it controls. Def something to keep an eye on.

pkg/flow/module_fail_test.go

rfratto · 2023-10-16T19:34:18Z

At least these cases:

Module cache must not persist between multiple instances of the same component.
Module cache must not persist when a component is terminated.

I looked through the tests but it wasn't obvious if these cases were being covered. If they're not, can you add tests for them?

mattdurham · 2023-10-16T20:39:22Z

Added additional checks for the two use cases above.

mattdurham added 7 commits September 17, 2023 11:50

push changes

6db453b

Add remove whenever using the component module.

a2c2755

Add additional context and remove dead file.

13c99ca

Add long changelog comment

dfa0b77

merge main

01552ce

fix linter

f8f9930

Remove mutex check

8ceafb2

mattdurham marked this pull request as ready for review September 21, 2023 13:36

mattdurham requested a review from a team as a code owner September 21, 2023 13:36

mattdurham requested review from rfratto and erikbaranowski September 21, 2023 13:36

erikbaranowski approved these changes Sep 21, 2023

View reviewed changes

rfratto reviewed Sep 21, 2023

View reviewed changes

mattdurham added 4 commits October 5, 2023 10:15

Add changes to support module id removal.

54b238c

Remove unneeded line

91c0abe

main merge

f7caffe

Fix merge errors.

cb8d986

mattdurham requested a review from rfratto October 5, 2023 14:46

rfratto reviewed Oct 5, 2023

View reviewed changes

pkg/flow/internal/controller/node_component.go Outdated Show resolved Hide resolved

pkg/flow/internal/controller/node_component.go Outdated Show resolved Hide resolved

pkg/flow/module.go Show resolved Hide resolved

mattdurham added 4 commits October 10, 2023 09:43

Ensure module is checked on first run.

92cb1c1

rename and add comments

f9e89e0

Add manual removal back in and make test closer to actual usage.

6d28509

Merge branch 'main' into module_id_fail

a96a32e

move changelog comment to correct location

6c44b21

rfratto reviewed Oct 10, 2023

View reviewed changes

A different approach by keying off run instead of build.

db23876

mattdurham changed the title ~~Module id fail~~ WIP: Module id fail Oct 11, 2023

mattdurham added 2 commits October 11, 2023 16:30

Add test for duplicate registration.

a9d509d

Minor changes

e81a2ef

mattdurham changed the title ~~WIP: Module id fail~~ Module id fail Oct 12, 2023

Merge branch 'main' into module_id_fail

e5588be

mattdurham requested review from erikbaranowski and rfratto October 12, 2023 20:18

rfratto approved these changes Oct 16, 2023

View reviewed changes

mattdurham added 2 commits October 16, 2023 16:37

PR feedback

e69db8c

Merge remote-tracking branch 'origin/module_id_fail' into module_id_fail

acaae50

mattdurham added 2 commits October 16, 2023 17:13

add locks around the reads for tests, its a bit hacky.

77777b0

Merge branch 'main' into module_id_fail

9b9f7ad

mattdurham merged commit 473938a into main Oct 16, 2023
7 checks passed

mattdurham deleted the module_id_fail branch October 16, 2023 21:26

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Module id fail #5261

Module id fail #5261

mattdurham commented Sep 21, 2023 •

edited

Loading

erikbaranowski left a comment

rfratto left a comment

mattdurham commented Sep 21, 2023

rfratto commented Sep 21, 2023

mattdurham commented Sep 26, 2023

mattdurham commented Sep 26, 2023 •

edited

Loading

rfratto commented Sep 26, 2023

mattdurham commented Oct 2, 2023

mattdurham commented Oct 5, 2023

rfratto Oct 10, 2023

rfratto commented Oct 10, 2023 •

edited

Loading

mattdurham commented Oct 11, 2023

rfratto left a comment

rfratto Oct 16, 2023

mattdurham Oct 16, 2023

rfratto Oct 16, 2023

mattdurham Oct 16, 2023

rfratto commented Oct 16, 2023

mattdurham commented Oct 16, 2023

Module id fail #5261

Module id fail #5261

Conversation

mattdurham commented Sep 21, 2023 • edited Loading

PR Description

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

erikbaranowski left a comment

Choose a reason for hiding this comment

rfratto left a comment

Choose a reason for hiding this comment

mattdurham commented Sep 21, 2023

rfratto commented Sep 21, 2023

mattdurham commented Sep 26, 2023

mattdurham commented Sep 26, 2023 • edited Loading

rfratto commented Sep 26, 2023

mattdurham commented Oct 2, 2023

mattdurham commented Oct 5, 2023

rfratto Oct 10, 2023

Choose a reason for hiding this comment

rfratto commented Oct 10, 2023 • edited Loading

mattdurham commented Oct 11, 2023

rfratto left a comment

Choose a reason for hiding this comment

rfratto Oct 16, 2023

Choose a reason for hiding this comment

mattdurham Oct 16, 2023

Choose a reason for hiding this comment

rfratto Oct 16, 2023

Choose a reason for hiding this comment

mattdurham Oct 16, 2023

Choose a reason for hiding this comment

rfratto commented Oct 16, 2023

mattdurham commented Oct 16, 2023

mattdurham commented Sep 21, 2023 •

edited

Loading

mattdurham commented Sep 26, 2023 •

edited

Loading

rfratto commented Oct 10, 2023 •

edited

Loading