Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[teleport-update] PID-based failure detection and rollback #49175

Merged
merged 3 commits into from
Nov 19, 2024

Conversation

sclevine
Copy link
Member

@sclevine sclevine commented Nov 19, 2024

This PR adds support for detecting various failures scenarios that could leave agents inaccessible during agent auto-updates. teleport-update will rollback the agent version immediately in these cases.

This is implemented by monitoring /run/teleport.pid for changes that indicate different failure modes. For example, if Teleport crashes after a soft systemctl reload, systemd is unaware, and a stale PID can be detected in /run/teleport.pid with no running process. Alternatively, if Teleport crashes after a hard restart, the PID file is rapidly created/removed with different PID values that can be detected. Other cases, such as hanging on quit, are covered as well. This catches fatal errors in new versions, as well as client-too-new errors (e.g., v17 agent on v16 cluster).

Notably, connection failures, including clients rejected by the server for being outdated, do not trigger a revert.

Compared the previous version of this functionality implemented in teleport-ent-upgrader, this new logic detects changes faster (<30s vs. >5m), and rolls the version back instead of rolling it forward. Additionally, this logic detects failures after a soft reload, while the previous logic did not in most cases.

This is the eighth in a series of PRs implementing teleport-update:
Setup Command: #49174
Link Command: #48712
Update Command: #48244
Reloading with rollbacks: #47929
Linking: #47879
Enable Command: #47565
Initial scaffolding PR: #46418

The teleport-update binary will be used to enable, disable, and trigger automatic Teleport agent updates. The new auto-updates system manages a local installation of the cluster-specified version of Teleport stored in /var/lib/teleport/versions.

RFD: #47126
Goal (internal): https://github.com/gravitational/cloud/issues/10289

Example: Upgrading to v17 on a v16 cluster, with successful rollback.

Nov 19 02:44:44 legendary-mite systemd[1]: Starting teleport-update.service - Teleport auto-update service...
Nov 19 02:44:45 legendary-mite teleport-update[595947]: 2024-11-19T02:44:45Z INFO [UPDATER]   Update available. Initiating update. target_version:17.0.1 active_version:16.4.7 agent/updater.go:475
Nov 19 02:45:46 legendary-mite teleport-update[595947]: 2024-11-19T02:45:46Z INFO [UPDATER]   Version already present. version:17.0.1 agent/installer.go:153
Nov 19 02:45:46 legendary-mite teleport-update[595947]: 2024-11-19T02:45:46Z INFO [UPDATER]   Executing new teleport-update binary to update configuration. agent/updater.go:185
Nov 19 02:45:46 legendary-mite teleport-update[596859]: 2024-11-19T02:45:46Z INFO [UPDATER]   Systemd configuration synced. agent/process.go:253
Nov 19 02:45:46 legendary-mite teleport-update[596859]: 2024-11-19T02:45:46Z INFO [UPDATER]   Service enabled. unit:teleport-update.timer agent/process.go:270
Nov 19 02:45:46 legendary-mite teleport-update[595947]: 2024-11-19T02:45:46Z INFO [UPDATER]   Finished executing new teleport-update binary. agent/updater.go:187
Nov 19 02:45:47 legendary-mite teleport-update[595947]: 2024-11-19T02:45:47Z INFO [UPDATER]   Systemd configuration synced. agent/process.go:253
Nov 19 02:45:47 legendary-mite teleport-update[595947]: 2024-11-19T02:45:47Z INFO [UPDATER]   Target version successfully installed. target_version:17.0.1 agent/updater.go:568
Nov 19 02:45:47 legendary-mite teleport-update[595947]: 2024-11-19T02:45:47Z INFO [UPDATER]   Gracefully reloaded. unit:teleport.service agent/process.go:110
Nov 19 02:45:47 legendary-mite teleport-update[595947]: 2024-11-19T02:45:47Z INFO [UPDATER]   Monitoring PID file to detect crashes. unit:teleport.service agent/process.go:113
Nov 19 02:45:51 legendary-mite teleport-update[595947]: 2024-11-19T02:45:51Z WARN [UPDATER]   Detected stale PID. unit:teleport.service pid:597038 agent/process.go:194
Nov 19 02:45:57 legendary-mite teleport-update[595947]: 2024-11-19T02:45:57Z ERRO [UPDATER]   Reverting symlinks due to failed restart. agent/updater.go:578
Nov 19 02:45:57 legendary-mite teleport-update[595947]: 2024-11-19T02:45:57Z INFO [UPDATER]   Systemd configuration synced. agent/process.go:253
Nov 19 02:45:57 legendary-mite teleport-update[595947]: 2024-11-19T02:45:57Z ERRO [UPDATER]   [stderr] Job for teleport.service failed. agent/process.go:362
Nov 19 02:45:57 legendary-mite teleport-update[595947]: 2024-11-19T02:45:57Z ERRO [UPDATER]   [stderr] See "systemctl status teleport.service" and "journalctl -xeu teleport.service" for details. agent/process.go:368
Nov 19 02:45:57 legendary-mite teleport-update[595947]: 2024-11-19T02:45:57Z ERRO [UPDATER]   Error running systemctl. args:[reload teleport.service] code:1 agent/process.go:298
Nov 19 02:45:57 legendary-mite teleport-update[595947]: 2024-11-19T02:45:57Z WARN [UPDATER]   Service ungracefully restarted. Connections potentially dropped. unit:teleport.service agent/process.go:108
Nov 19 02:45:57 legendary-mite teleport-update[595947]: 2024-11-19T02:45:57Z INFO [UPDATER]   Monitoring PID file to detect crashes. unit:teleport.service agent/process.go:113
Nov 19 02:46:11 legendary-mite teleport-update[595947]: 2024-11-19T02:46:11Z WARN [UPDATER]   Teleport updater encountered an error during the update and successfully reverted the installation. agent/updater.go:586
Nov 19 02:46:11 legendary-mite teleport-update[595947]: ERROR: failed to start new version "17.0.1" of Teleport: detected crashing process
Nov 19 02:46:11 legendary-mite systemd[1]: teleport-update.service: Main process exited, code=exited, status=1/FAILURE
Nov 19 02:46:11 legendary-mite systemd[1]: teleport-update.service: Failed with result 'exit-code'.
Nov 19 02:46:11 legendary-mite systemd[1]: Failed to start teleport-update.service - Teleport auto-update service.
^C
ubuntu@legendary-mite:~$ ls -la /usr/local/bin/
total 12
drwxr-xr-x  2 root root 4096 Nov 19 02:45 .
drwxr-xr-x 11 root root 4096 Nov 13 01:41 ..
lrwxrwxrwx  1 root root   53 Nov 19 02:45 fdpass-teleport -> /var/lib/teleport/versions/16.4.7/bin/fdpass-teleport
lrwxrwxrwx  1 root root   42 Nov 19 02:45 tbot -> /var/lib/teleport/versions/16.4.7/bin/tbot
lrwxrwxrwx  1 root root   42 Nov 19 02:45 tctl -> /var/lib/teleport/versions/16.4.7/bin/tctl
lrwxrwxrwx  1 root root   46 Nov 19 02:45 teleport -> /var/lib/teleport/versions/16.4.7/bin/teleport
lrwxrwxrwx  1 root root   65 Nov 17 22:29 teleport-update -> /home/ubuntu/mounts/teleport/
ubuntu@legendary-mite:~$ systemctl status teleport
● teleport.service - Teleport Service
     Loaded: loaded (/usr/lib/systemd/system/teleport.service; enabled; preset: enabled)
     Active: active (running) since Tue 2024-11-19 03:09:20 UTC; 4min 42s ago

@sclevine sclevine added the no-changelog Indicates that a PR does not require a changelog entry label Nov 19, 2024
@github-actions github-actions bot requested review from Joerger and r0mant November 19, 2024 05:40
Comment on lines +131 to +142
g.Go(func() error {
return tickFile(ctx, s.PIDPath, pidC, tickC)
})
err := s.waitForStablePID(ctx, minRunningIntervalsBeforeStable, maxCrashesBeforeFailure,
initPID, pidC, func(pid int) error {
p, err := os.FindProcess(pid)
if err != nil {
return trace.Wrap(err)
}
return trace.Wrap(p.Signal(syscall.Signal(0)))
})
cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is the asynchronous approach necessary here?

Copy link
Member Author

@sclevine sclevine Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separating the PID file read and monitoring logic into separate gorountines made both parts easier to test. For example, with a single goroutine I'd have to write the PID file before or after the tick, when waitForStablePID could still be reading it. There are definitely other approaches to detecting whether waitForStablePID is finished reading (e.g., hooking verifyPID), but this was easiest for me to reason about.

@sclevine sclevine added this pull request to the merge queue Nov 19, 2024
Merged via the queue into master with commit 4946e28 Nov 19, 2024
40 checks passed
@sclevine sclevine deleted the sclevine/teleport-update-pid branch November 19, 2024 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-changelog Indicates that a PR does not require a changelog entry size/md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants