Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: autorecover from stuck situations #354

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

gerrnot
Copy link

@gerrnot gerrnot commented Jul 12, 2024

Since on the helm mailing list there was no veto and only one positive feedback from a ranger representative (see references) among the many likes in various related bug threads, I am submitting this new HIP...

@lindhe
Copy link

lindhe commented Jul 24, 2024

Hey, I just want to say that you are awesome for driving this HIP! 💪 I think there is a dire need for this. I've seen many people loose trust in Helm because of things getting stuck irrecoverably, it would be so nice to redeem Helm. Also, this will be of great value for all the tools that depend on Helm (like Rancher).

I'll read your document and put down some comments. On a first read, it's just formatting things or typos I'm guessing. I might contribute something better once I've digested your suggestion.

Tell me if I can help!

hips/hip-9999.md Outdated Show resolved Hide resolved
hips/hip-9999.md Outdated Show resolved Hide resolved
@lindhe
Copy link

lindhe commented Jul 24, 2024

The --timout parameter gets a deeper meaning.
Previously the --timout parameter only had an effect on the helm process running on the respective client.
After implementation, the --timout parameter will be stored in the helm release object (secret) in k8s and
have an indirect impact on possible parallel processes.

I think it is really clever reusing the --timeout flag for this purpose! Nice.

After implementation, the --timout parameter will be stored in the helm release object (secret) in k8s and
have an indirect impact on possible parallel processes.

`helm ls -a` shows two new columns, regular `helm ls` does NOT show those:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that would not be fully backwards compatible. Other programs that depend on Helm may parse the output from helm ls -a and expect the number of columns to be unchanged.

Not sure how bad that is, though. Every change is a breaking change to someone, and I think this seems like a reasonable change. And parsing the default (table) output is not robust anyway, if they do parse the output they should probably use -o json or -o yaml instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very open to change the behaviour regarding -a, since -a traditionally means all releases (in all states), rather than the colums displayed.

Maybe rather tie it to the existing flag --debug or add a new flag? I actually would prefer that, maybe call it --concurrency? It's really just that most users will never need to see it...

@lindhe : What do you think?

Sidenote: There is also the --short flag, but that false per default, so we cannot use that and having a long flag could yield to funny situations, e.g. specifying short and long....

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm impartial to changing it or not. If this feature can get added with the modification to -a, I think that's a good change. But if that causes any roadblocks I do not think it's worth it.

I think adding a new flag sounds a bit convoluted. If it cannot be part of -a by default , then it might make sense to tie into --debug like you suggest. But I would like to get some input from Helm maintainers before going too far with the specifics of that implementation. If they agree that it is OK to introduce a new column by default, that's probably the cleanest solution.

@lindhe
Copy link

lindhe commented Jul 24, 2024

I think this HIP has an additional benefit that should probably be mentioned in the document: it should make multi-user scenarios more robust in general, not just fix the situations where Helm gets stuck. What happens today if two clients run similar commands at the same time? If there is no mechanism to handle that, I believe this HIP would be a great addition to handle those scenarios gracefully.

@lindhe
Copy link

lindhe commented Jul 24, 2024

I think you are on the right track with your implementation, trying to keep this lock inside the Helm release object. That way, we do not introduce new objects to the Helm ecosystem, which is a great thing.

That said, I just want to mention that there is a Lease object in Kubernetes. If we were looking for alternatives, that might be a sensible thing to use.

@pull-request-size pull-request-size bot added size/M and removed size/L labels Jul 24, 2024
@gerrnot
Copy link
Author

gerrnot commented Jul 24, 2024

I think this HIP has an additional benefit that should probably be mentioned in the document: it should make multi-user scenarios more robust in general, not just fix the situations where Helm gets stuck. What happens today if two clients run similar commands at the same time? If there is no mechanism to handle that, I believe this HIP would be a great addition to handle those scenarios gracefully.

There is already a mechanism to detect concurrency, since a certain helm3 version. It is based on the "pending" states in the release object. While implemented with good intentions this has the consequence that one cannot recover anymore in bad cases. So I would not mention this in addition. Nevertheless, if you feel that additional benefits could be mentioned, feel free suggest an edit again, I will happily merge it if it increases the chances of the HIP.


One more thing to discuss that I realized during the implementation:
Helm always uses the client time, e..g for the creation of the UPDATED field etc.

If I would now introduce using the k8s server time it could yield to weird timestamps deviations between other helm timestamp fields vs the ones used for locking. While I am still convinced that server time is better, this would make most sense only if this is refactored within the entire codebase, which I guess is out of scope and would require an independent HIP, what do you think?

@kfox1111
Copy link

kubernetes has distributed locking primitives I think? Couldn't one of those be used?

@gerrnot
Copy link
Author

gerrnot commented Jul 24, 2024

@kfox1111 I assume you mean the Lease object as suggest by @lindhe.

Yes, that could be used, it is just that the release state is then distributed:

  1. the traditional helm release secret
  2. the lock state in k8s

Since I consider the locked-state as an attribute of the release state, the question is whether is is worth to store it somewhere else...

Nevertheless, it could increase the backwards compatibility if we see it as something that is honored if it is there and else not.

@pull-request-size pull-request-size bot added size/L and removed size/M labels Jul 25, 2024
@lindhe
Copy link

lindhe commented Jul 30, 2024

One more thing to discuss that I realized during the implementation: Helm always uses the client time, e..g for the creation of the UPDATED field etc.

If I would now introduce using the k8s server time it could yield to weird timestamps deviations between other helm timestamp fields vs the ones used for locking. While I am still convinced that server time is better, this would make most sense only if this is refactored within the entire codebase, which I guess is out of scope and would require an independent HIP, what do you think?

Sounds like a bummer. I agree 1000% that server time is superior. I would go so far as to say that it is the only sensible thing to use. But I also agree that it probably belongs to another PR (unfortunately).

@gerrnot
Copy link
Author

gerrnot commented Aug 6, 2024

original HIP draft - storing lock in k8s secret

  • A backwards compatible preview version is published as docker image (platform linux/amd64) for use of one's own risk:
    gernotfeichter/helm-autorecover:0.0.3 (based on v3.16)
  • Clarified why client time is used for LOCKED TILL.

possible change of storing the lock as native k8s lock

If agreement is reached that such a solution is better overall, I can invest the effort of rewriting the HIP and the draft code.

benefits

  • helm ls could be left as is:
    • no breaking change in the output
    • discussions based on which flag to show the new fields LOCKED TILL and SESSION ID are obsolete.
  • standardisation is always good (good library support for working with said object)
  • visibility/accessiblity: The lease object can be inspected by any user with sufficient permission, not even a helm binary is required for that.

drawbacks

  • The release state is split (one part is stored in a k8s secret, another part - at least temporarily is in a Lease object.).
    • Codewise: additional libraries need to be added and additional communication needs to happen to k8s to perform CRUD operations on the Lease object.
    • Userwise: The user needs to have additional knowledge in case of debugging concurrency issues.
  • The user under which the helm operations are performed require CRUD permissions on the Lease object, that is a potential breaking change. We could think of a fallback though, e.g. if insufficient privileges for lease -> ignore log a warning and continue, but this could introduce further risks/instabilities due to concurrency issues if ignored then, which was not the goal here. The goal was to keep concurrent modification protection while also allowing escaping stuck situations.

Based on this, @kfox1111 do you still think the k8s lock (Lease) should be used? I think it might be ok because I estimate that most people will run the helm client with a user with namespace admin permissions.

So I'm only asking for a final ok from a maintainer for the Lease solution before starting further work, I personally think the benefits of the Lease slightly outweigh the alternative solution.

gerrnot and others added 5 commits October 15, 2024 17:39
Co-authored-by: Andreas Lindhé <[email protected]>
Signed-off-by: gerrnot <[email protected]>
Co-authored-by: Andreas Lindhé <[email protected]>
Signed-off-by: gerrnot <[email protected]>
Co-authored-by: Andreas Lindhé <[email protected]>
Signed-off-by: gerrnot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants