-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Module id fail #5261
Module id fail #5261
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice test case to validate this 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about this specific implementation too.
Ideally, the implementation of a component doesn't need to worry about previous iteration of that same component.
Does the module creation error still happen if ComponentNode is responsible for terminating/cleaning created modules on an evaluation failure or when the component shuts down?
The root is at https://github.com/grafana/agent/blob/main/pkg/flow/module.go#L48 , that adds the path to the registry/modules which is called from https://github.com/grafana/agent/blob/main/component/module/file/file.go#L64 then Update fails since the file doesn't exist so it returns nil for the component and the code at https://github.com/grafana/agent/blob/main/pkg/flow/module.go#L137C1-L137C1 never gets called to remove it. This may also be an outgrowth of Run vs Update debate of what we should do in either. IMO the cleanest step would be adding a Cleanup lifecycle that is called regardless if Run is called for a component, though that would require always returning a component. |
Who would be the caller of this? In my mind, ComponentNode has the most control over doing this transparently to component implementations:
So, ComponentNode could unregister components using the module controller directly in one of these cases:
Neither of those would require access to the constructed component, so it would still allow for component constructors to return |
Sounds solid though using the ComponentNode would mean we would need to check for an interface or flag since it would make sense to move the Removal and Addition of module ids to the component node so only one thing is controlling the add/remove id lifecycle and it is one place. |
Though after giving it a second thought using ComponentNode breaks the moment we add multiple modules being loaded. Or at least gets more complicated. |
Can you explain what you mean a little bit more? I'm confused by why multiple modules complicates the situation, and I'm not sure I'm following the "we would need to check for an interface or flag" bit either. Maybe it's worth having a prototype of that approach so we can get on the same page? |
Will post up some psuedocode later in the week to go over what I am thinking. |
Updated with a cleaner approach. |
// Check for failure on initial loading, if not then we need to remove any modules it the component may habe created. | ||
// In the case where the component is not a module loader this is a noop. | ||
cn.moduleFailureCheck.Do(func() { | ||
if err != nil { | ||
cn.moduleController.ClearModuleIDs() | ||
} | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still a problem here: what if a component fails to be constructed more than once?
Rather than using a sync.Once, this check could be moved down to line 289 (i.e., if cn.reg.Build
fails) so you only clear module IDs for components which can't be built, no matter how many times the construction fails.
Given that we found two rounds of bugs in the changes to the controller, can we add more behavioral tests to the controller for the issues we've found? At least these cases:
If you identify more behaviors around modules that should be verified, please add tests for them too. The controller is one of our critical pieces of code, so it should be harder to introduce bugs than it currently is and warrants caution in cases where coverage is low. |
After reviewing feedback and trying a few different approaches, I really wanted to avoid touching the controller/node_component/loader code. Instead I moved the id check to the Run method. This means we have a slight delay on modules being registered/visible but I havent found an issue with that yet. I also think that is more representative of the running system. Duplicate IDs don't happen anymore but left the check in. Duplicate registry metrics trigger first in the loader. This also means the removal and insertion of IDs are within the same code block which seems simpler. Granted a module loader could now queue multiple is ids of the same sort but no loader loads multiple ids at the moment. When we add support for loading multiple ids, that would fall upon the individual module loader itself to ensure that aren't loader. With the registry duplicate being the catch all. Ideally this would be a different error since its not pointing out the exact problem. Going to set this as WIP while I poke at it a bit more in the morning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some remarks (that don't need to be addressed) and one that needs an answer, but let's get this merged!
@@ -49,7 +49,7 @@ type Module interface { | |||
// | |||
// Run blocks until the provided context is canceled. The ID of a module as defined in | |||
// ModuleController.NewModule will not be released until Run returns. | |||
Run(context.Context) | |||
Run(context.Context) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as an aside: this would be considered a breaking change to the API, which is re-enforcing the idea for me that everything should be moved to internal for 1.0 until we're ready to start exposing parts of our code as stable APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this is mostly for the tests so we can check specific error conditions.
if err := c.o.parent.addModule(c); err != nil { | ||
return err | ||
} | ||
defer c.o.parent.removeModule(c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should get this merged, but it's standing out as potentially concerning to me that this changes things such that the lifetime of a component and module are now different: a component exists within the controller whenever it's defined in a file, even if it's not running, but a module doesn't exist until it starts running.
I'm not sure if this will introduce any problems, so let's keep an eye for related issues once this is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this feels alright, the component is the parent of module(s), so the component life span should be greater than the modules it controls. Def something to keep an eye on.
I looked through the tests but it wasn't obvious if these cases were being covered. If they're not, can you add tests for them? |
Added additional checks for the two use cases above. |
PR Description
This handles the use case where a module gets added but with an invalid configuration, lets say a module.file with invalid config. Then gets resolved. Since the first module failed without running, the ID was added to the registry on NewModule but since Run was never called it never gets cleaned up.
Which issue(s) this PR fixes
Closes #4702
Notes to the Reviewer
I am not a huge fan of this solution. I tried an approach to automatically clean it up but since the error happens deep in the loader by the time the error bubbled up to the parent module it was to late.
I also thought about and scaffolded a cleanup lifecycle step but that felt to large of a change.
PR Checklist