-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[teleport-update] PID-based failure detection and rollback #49175
Conversation
g.Go(func() error { | ||
return tickFile(ctx, s.PIDPath, pidC, tickC) | ||
}) | ||
err := s.waitForStablePID(ctx, minRunningIntervalsBeforeStable, maxCrashesBeforeFailure, | ||
initPID, pidC, func(pid int) error { | ||
p, err := os.FindProcess(pid) | ||
if err != nil { | ||
return trace.Wrap(err) | ||
} | ||
return trace.Wrap(p.Signal(syscall.Signal(0))) | ||
}) | ||
cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is the asynchronous approach necessary here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separating the PID file read and monitoring logic into separate gorountines made both parts easier to test. For example, with a single goroutine I'd have to write the PID file before or after the tick, when waitForStablePID could still be reading it. There are definitely other approaches to detecting whether waitForStablePID is finished reading (e.g., hooking verifyPID), but this was easiest for me to reason about.
This PR adds support for detecting various failures scenarios that could leave agents inaccessible during agent auto-updates.
teleport-update
will rollback the agent version immediately in these cases.This is implemented by monitoring
/run/teleport.pid
for changes that indicate different failure modes. For example, if Teleport crashes after a softsystemctl reload
, systemd is unaware, and a stale PID can be detected in/run/teleport.pid
with no running process. Alternatively, if Teleport crashes after a hard restart, the PID file is rapidly created/removed with different PID values that can be detected. Other cases, such as hanging on quit, are covered as well. This catches fatal errors in new versions, as well as client-too-new errors (e.g., v17 agent on v16 cluster).Notably, connection failures, including clients rejected by the server for being outdated, do not trigger a revert.
Compared the previous version of this functionality implemented in
teleport-ent-upgrader
, this new logic detects changes faster (<30s vs. >5m), and rolls the version back instead of rolling it forward. Additionally, this logic detects failures after a soft reload, while the previous logic did not in most cases.This is the eighth in a series of PRs implementing
teleport-update
:Setup Command: #49174
Link Command: #48712
Update Command: #48244
Reloading with rollbacks: #47929
Linking: #47879
Enable Command: #47565
Initial scaffolding PR: #46418
The
teleport-update
binary will be used to enable, disable, and trigger automatic Teleport agent updates. The new auto-updates system manages a local installation of the cluster-specified version of Teleport stored in/var/lib/teleport/versions
.RFD: #47126
Goal (internal): https://github.com/gravitational/cloud/issues/10289
Example: Upgrading to v17 on a v16 cluster, with successful rollback.