-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(workload-launcher): SocketException handling #350
Conversation
FYI I've built this image and used in my production environment and it successfully fixed SocketException issues. If anyone has the issue, the image is available here while this PR is not merged: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hanslemm, thank you for you contribution. I have some concerns with some the change which i explained inline.
Looking an this issue, I think the bug is that the launcher doesn't recover properly. However, I don't think it should fail because it is unable to reach a dependency.
return Failsafe.with( | ||
RetryPolicy.builder<Any>() | ||
.withBackoff(Duration.ofSeconds(20), Duration.ofDays(365)) | ||
.withBackoff(Duration.ofSeconds(20), Duration.ofMinutes(10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should simply change this value.
One of the requirements of the launcher is to make sure we never exceed a given amount of pending workloads. I believe the current logic relies on this retry "forever" bit.
This would effectively allow the launcher to start after 10mins if it isn't able to fetch the list of workloads that were in flight.
Coming back to your initial intent, at the moment, the launcher remains stuck in this state by design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gosusnp, thanks for taking the time to review this.
True, I missed the maximum retry policy, the way it was proposed just set the maximum back off to 10 minutes, but still retries forever (I would need to add a .withMaxDuration(Duration.ofXXX(X)))
. I also got your view that you think this is not necessary, and it should be retried forever.
Yet, from my perspective, since ClaimedProcessor.getWorkloadList()
is not optional for the pod to function, but rather sine qua non, I believe it is fair to fail after some threshold, similar to how pipelines fail after X retries. This brings more transparency on the real status of the pod, instead of just logging that it didn't connect for 20s backoffs until 365 days, and every 365 days after. What is the ideal threshold? I am not sure.
I do understand that it seems that this forever retry is relevant for other aspects of the project for reasons that are not 100% clear to me (although I believe that nothing should retry forever - after a given limit it should stop and throw an error for transparency and resource optimization - but this is just my opinion).
Perhaps the ideal would be to create ENV VARs with backoff, maximum backoff, and maximum retry time limit, so that developers can have the option on what is their ideal threshold (exposing in the Helm chart).
Furthermore, if we implement the retry mechanism for SocketException
alone (without the 10 minutes fail throw or an ENV VAR for this), it is, in my view, already a win, with no apparent downsides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I view the dependency of the workload-launcher
on the workfload-api
the same as an api-server on a database. In the api-server / database situation, you don't want your liveness to depend on the db to avoid churning through all the pods if the db was to become unavailable. I believe this is true for the launcher and the workload-api as well.
Now, I am interested in trying to understand why you are running into this situation more than we do. What I suspect is that we might have a couple of issues, I wonder if one of them is that the launcher starts before the workload-api.
Would you mind sharing some context into what operations you are doing when running into this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only operation I do when I run into this is when I do a new release.
As you've mentioned, most likely the workload-api is not yet ready by the time workload-launcher do the request, and without a retry it cannot recover, forcing me to manually kill the pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the pod is not killed, the pipelines won't run, and will retry as long as they can (without success of course).
Killing the pod instantly recovers Airbyte once it is back.
@gosusnp can we release this? I removed already in the last commit your point of concern. The current state retries the Thanks. |
@hanslemm , sorry for the delay, haven't forgotten about this, a bit of backlog on my end. What I'd like to understand is the effort of fixing our boot logic compared to extending an error handling in a way that could lead to surprises. |
So I spent some time trying to understand how we could remain stuck in this loop. Looking at the documentation of FailSafe, I think we might have a few errors in this code that is causing us to not retry as expected
What I think we need for our expected behavior
I can spend more time tomorrow to test my theory and fix. |
Thanks @gosusnp. |
Small update here, as I started setting up some tests, I found an issue that could leave the |
@hanslemm, I merged f75a0fa which is a mix of making sure we retry on socket exceptions, simplifying some of the nested retry loop (mostly for our own sanity) and making sure those conditions are now tested. Please let us know if you still observe such errors. |
We are going to close this. Please open an issue if there are follow up questions, thanks! |
Perfect, thanks guys! I will update my app in the cluster after Airbyte 1.0 event. |
What
These code changes focus on enabling proper handling of SocketExceptions in the workload-launcher, more specifically, when trying to use the method
ClaimedProcessor.getWorkloadList()
.Before this PR,
SocketException
errors would go "unnoticed", creating a false healthy pod, when the service should either retry or throw an error, since it cannot perform its core functions.How
2. It limits the maximum retry policy to 10 minutes instead of 365 days3. If the SocketException still occurs after the 10 minutes maximum retry, it throws the error.With this approach,
SocketException
can be properly handled and is more likely to be recovered by the application.If not recovered, the error throw will fail the deployment, which is ideal since without a successful.getWorkloadList()
method call the service is incapable of operating its core functions (creating false healthy pod).Reasoning
In my deployment, I found that my workload-launcher pod has a
SocketException
withOperation not permitted
whenever I do an update, which requires a manual deletion of the pod to recover. That is because theSocketException
is not retried and, although being fundamental for the healthy operation of the pod, also doesn't throw an error - meaning that the pod will hang on as healthy in K8s, although it cannot launch workloads.Here are the logs that I found in my deployment:
Can this PR be safely reverted and rolled back?