Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module id fail #5261

Merged
merged 24 commits into from
Oct 16, 2023
Merged

Module id fail #5261

merged 24 commits into from
Oct 16, 2023

Conversation

mattdurham
Copy link
Collaborator

@mattdurham mattdurham commented Sep 21, 2023

PR Description

This handles the use case where a module gets added but with an invalid configuration, lets say a module.file with invalid config. Then gets resolved. Since the first module failed without running, the ID was added to the registry on NewModule but since Run was never called it never gets cleaned up.

Which issue(s) this PR fixes

Closes #4702

Notes to the Reviewer

I am not a huge fan of this solution. I tried an approach to automatically clean it up but since the error happens deep in the loader by the time the error bubbled up to the parent module it was to late.

I also thought about and scaffolded a cleanup lifecycle step but that felt to large of a change.

PR Checklist

  • CHANGELOG.md updated
  • Tests updated

@mattdurham mattdurham marked this pull request as ready for review September 21, 2023 13:36
@mattdurham mattdurham requested a review from a team as a code owner September 21, 2023 13:36
Copy link
Contributor

@erikbaranowski erikbaranowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice test case to validate this 💯

Copy link
Member

@rfratto rfratto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about this specific implementation too.

Ideally, the implementation of a component doesn't need to worry about previous iteration of that same component.

Does the module creation error still happen if ComponentNode is responsible for terminating/cleaning created modules on an evaluation failure or when the component shuts down?

@mattdurham
Copy link
Collaborator Author

The root is at https://github.com/grafana/agent/blob/main/pkg/flow/module.go#L48 , that adds the path to the registry/modules which is called from https://github.com/grafana/agent/blob/main/component/module/file/file.go#L64 then Update fails since the file doesn't exist so it returns nil for the component and the code at https://github.com/grafana/agent/blob/main/pkg/flow/module.go#L137C1-L137C1 never gets called to remove it.

This may also be an outgrowth of Run vs Update debate of what we should do in either.

IMO the cleanest step would be adding a Cleanup lifecycle that is called regardless if Run is called for a component, though that would require always returning a component.

@rfratto
Copy link
Member

rfratto commented Sep 21, 2023

IMO the cleanest step would be adding a Cleanup lifecycle that is called regardless if Run is called for a component, though that would require always returning a component.

Who would be the caller of this?

In my mind, ComponentNode has the most control over doing this transparently to component implementations:

  1. ComponentNode has access to the module controller instance and can remove modules on behalf of a component.
  2. ComponentNode knows when it's constructing a component, and if that construction fails.
  3. ComponentNode knows when a component exits (i.e., when Run exits).

So, ComponentNode could unregister components using the module controller directly in one of these cases:

  1. The initial component construction fails.
  2. The Run method exits.

Neither of those would require access to the constructed component, so it would still allow for component constructors to return nil (which is also important behavior for #4411 to work properly).

@mattdurham
Copy link
Collaborator Author

Sounds solid though using the ComponentNode would mean we would need to check for an interface or flag since it would make sense to move the Removal and Addition of module ids to the component node so only one thing is controlling the add/remove id lifecycle and it is one place.

@mattdurham
Copy link
Collaborator Author

mattdurham commented Sep 26, 2023

Though after giving it a second thought using ComponentNode breaks the moment we add multiple modules being loaded. Or at least gets more complicated.

@rfratto
Copy link
Member

rfratto commented Sep 26, 2023

Can you explain what you mean a little bit more? I'm confused by why multiple modules complicates the situation, and I'm not sure I'm following the "we would need to check for an interface or flag" bit either.

Maybe it's worth having a prototype of that approach so we can get on the same page?

@mattdurham
Copy link
Collaborator Author

Will post up some psuedocode later in the week to go over what I am thinking.

@mattdurham
Copy link
Collaborator Author

Updated with a cleaner approach.

@mattdurham mattdurham requested a review from rfratto October 5, 2023 14:46
pkg/flow/internal/controller/node_component.go Outdated Show resolved Hide resolved
pkg/flow/internal/controller/node_component.go Outdated Show resolved Hide resolved
pkg/flow/module.go Show resolved Hide resolved
Comment on lines 261 to 267
// Check for failure on initial loading, if not then we need to remove any modules it the component may habe created.
// In the case where the component is not a module loader this is a noop.
cn.moduleFailureCheck.Do(func() {
if err != nil {
cn.moduleController.ClearModuleIDs()
}
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still a problem here: what if a component fails to be constructed more than once?

Rather than using a sync.Once, this check could be moved down to line 289 (i.e., if cn.reg.Build fails) so you only clear module IDs for components which can't be built, no matter how many times the construction fails.

@rfratto
Copy link
Member

rfratto commented Oct 10, 2023

Given that we found two rounds of bugs in the changes to the controller, can we add more behavioral tests to the controller for the issues we've found? At least these cases:

  • Module cache must not persist between multiple instances of the same component.
  • Module cache must not persist when a component is terminated.

If you identify more behaviors around modules that should be verified, please add tests for them too.

The controller is one of our critical pieces of code, so it should be harder to introduce bugs than it currently is and warrants caution in cases where coverage is low.

@mattdurham
Copy link
Collaborator Author

After reviewing feedback and trying a few different approaches, I really wanted to avoid touching the controller/node_component/loader code. Instead I moved the id check to the Run method. This means we have a slight delay on modules being registered/visible but I havent found an issue with that yet. I also think that is more representative of the running system.

Duplicate IDs don't happen anymore but left the check in. Duplicate registry metrics trigger first in the loader. This also means the removal and insertion of IDs are within the same code block which seems simpler. Granted a module loader could now queue multiple is ids of the same sort but no loader loads multiple ids at the moment. When we add support for loading multiple ids, that would fall upon the individual module loader itself to ensure that aren't loader. With the registry duplicate being the catch all. Ideally this would be a different error since its not pointing out the exact problem.

Going to set this as WIP while I poke at it a bit more in the morning.

@mattdurham mattdurham changed the title Module id fail WIP: Module id fail Oct 11, 2023
@mattdurham mattdurham changed the title WIP: Module id fail Module id fail Oct 12, 2023
Copy link
Member

@rfratto rfratto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some remarks (that don't need to be addressed) and one that needs an answer, but let's get this merged!

@@ -49,7 +49,7 @@ type Module interface {
//
// Run blocks until the provided context is canceled. The ID of a module as defined in
// ModuleController.NewModule will not be released until Run returns.
Run(context.Context)
Run(context.Context) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as an aside: this would be considered a breaking change to the API, which is re-enforcing the idea for me that everything should be moved to internal for 1.0 until we're ready to start exposing parts of our code as stable APIs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this is mostly for the tests so we can check specific error conditions.

Comment on lines +144 to +147
if err := c.o.parent.addModule(c); err != nil {
return err
}
defer c.o.parent.removeModule(c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should get this merged, but it's standing out as potentially concerning to me that this changes things such that the lifetime of a component and module are now different: a component exists within the controller whenever it's defined in a file, even if it's not running, but a module doesn't exist until it starts running.

I'm not sure if this will introduce any problems, so let's keep an eye for related issues once this is merged.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this feels alright, the component is the parent of module(s), so the component life span should be greater than the modules it controls. Def something to keep an eye on.

pkg/flow/module_fail_test.go Outdated Show resolved Hide resolved
@rfratto
Copy link
Member

rfratto commented Oct 16, 2023

At least these cases:

Module cache must not persist between multiple instances of the same component.
Module cache must not persist when a component is terminated.

I looked through the tests but it wasn't obvious if these cases were being covered. If they're not, can you add tests for them?

@mattdurham
Copy link
Collaborator Author

Added additional checks for the two use cases above.

@mattdurham mattdurham merged commit 473938a into main Oct 16, 2023
7 checks passed
@mattdurham mattdurham deleted the module_id_fail branch October 16, 2023 21:26
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Module ID registration failure results in permanent error for that Module ID until restart
3 participants