Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix app keepalive #42240

Merged
merged 1 commit into from
Jun 1, 2024
Merged

fix app keepalive #42240

merged 1 commit into from
Jun 1, 2024

Conversation

fspmarshall
Copy link
Contributor

@fspmarshall fspmarshall commented May 31, 2024

Fixes an issue where individual app keepalives failures would kill the GRPC stream over which all apps were being heartbeat.

There is at least one case (when an agent decides to stop serving a given app), where keepalives are expected/supposed to fail. Additionally, keepalives are a nice-to-have feature rather than something critical for teleport function. Rather than try to only suppress failure in the expected case (which can be a bit tricky to distinguish from an unexpected failure), we this change just always stops keep alive attempts for a given app if we hit multiple consecutive failures. This keeps the overall logic simple, and doesn't have much of a downside since a spurious stop will get fixed at the next heartbeat anyway.

changelog: fixed an issue where removing an app could make teleport app agents incorrectly report as unhealthy for a short time.

Copy link

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

@github-actions github-actions bot requested review from tcsc and tigrato May 31, 2024 19:21
Copy link

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

Copy link
Contributor

@rosstimothy rosstimothy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can we get some kind of test coverage that captures flapping agents not taking down the ICS?

@fspmarshall fspmarshall force-pushed the fspmarshall/fix-app-keepalive branch from 64de56a to 0bfaaa5 Compare May 31, 2024 21:57
@fspmarshall fspmarshall requested a review from rosstimothy May 31, 2024 21:58
@public-teleport-github-review-bot public-teleport-github-review-bot bot removed the request for review from tcsc June 1, 2024 10:24
@fspmarshall fspmarshall added this pull request to the merge queue Jun 1, 2024
Merged via the queue into master with commit 0aa5285 Jun 1, 2024
37 checks passed
@fspmarshall fspmarshall deleted the fspmarshall/fix-app-keepalive branch June 1, 2024 14:05
@public-teleport-github-review-bot

@fspmarshall See the table below for backport results.

Branch Result
branch/v13 Create PR
branch/v14 Create PR
branch/v15 Create PR
branch/v16 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants