-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix app keepalive #42240
fix app keepalive #42240
Conversation
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Can we get some kind of test coverage that captures flapping agents not taking down the ICS?
64de56a
to
0bfaaa5
Compare
@fspmarshall See the table below for backport results.
|
Fixes an issue where individual app keepalives failures would kill the GRPC stream over which all apps were being heartbeat.
There is at least one case (when an agent decides to stop serving a given app), where keepalives are expected/supposed to fail. Additionally, keepalives are a nice-to-have feature rather than something critical for teleport function. Rather than try to only suppress failure in the expected case (which can be a bit tricky to distinguish from an unexpected failure), we this change just always stops keep alive attempts for a given app if we hit multiple consecutive failures. This keeps the overall logic simple, and doesn't have much of a downside since a spurious stop will get fixed at the next heartbeat anyway.
changelog: fixed an issue where removing an app could make teleport app agents incorrectly report as unhealthy for a short time.