Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

heartbeat slightly more frequently on disruption #44715

Merged
merged 1 commit into from
Aug 12, 2024

Conversation

fspmarshall
Copy link
Contributor

This PR updates stream-based heartbeat logic to slightly increase the announce frequency for the first heartbeat after a disruption. It has been noticed during testing that if the auth server fails in a manner that doesn't result in the stream being immediately closed (deadlock, backend i/o timeout, etc), the agent may erroneously believe its most recent heartbeat was successful when it wasn't. This can lead to an agent's backend resource becoming very stale. If other factors (e.g. cpu constraints on agent, auth, or both) are at play, it may even allow the agent's heartbeat state in the backend to briefly expire, rendering it momentarily undialable. This change is intended to mitigate this issue without significantly increasing backend writes on a mass reconnection event.

@fspmarshall fspmarshall added no-changelog Indicates that a PR does not require a changelog entry backport/branch/v14 backport/branch/v15 backport/branch/v16 labels Jul 26, 2024
@github-actions github-actions bot requested review from fheinecke and ryanclark July 26, 2024 22:27
@fspmarshall fspmarshall force-pushed the fspmarshall/disruption-heartbeat-interval branch from be3f9ca to 921cfa2 Compare July 29, 2024 16:02
Comment on lines 30 to 31
// TestLastTick verifies that the LastTick method returns the last tick time as expected. Due to the
// flaky nature of time-based tests runs many cases and passes if >50% of them succeed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use a fake clock instead?

Comment on lines +134 to +118
// TestIntervalResetTo verifies the basic behavior of the interval ResetTo method.
// Since time based tests tend to be flaky, this test passes if it has a >50% success
// rate (i.e. >50% of ResetTo calls seemed to have changed the timer's behavior as expected).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use a fake clock instead?

@fspmarshall fspmarshall force-pushed the fspmarshall/disruption-heartbeat-interval branch from 921cfa2 to 2defa53 Compare August 5, 2024 16:37
@public-teleport-github-review-bot public-teleport-github-review-bot bot removed the request for review from fheinecke August 7, 2024 19:50
@fspmarshall fspmarshall added this pull request to the merge queue Aug 12, 2024
Merged via the queue into master with commit 983da44 Aug 12, 2024
35 checks passed
@fspmarshall fspmarshall deleted the fspmarshall/disruption-heartbeat-interval branch August 12, 2024 23:26
@public-teleport-github-review-bot

@fspmarshall See the table below for backport results.

Branch Result
branch/v14 Create PR
branch/v15 Create PR
branch/v16 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants