[teleport-update] PID-based failure detection and rollback #49175

sclevine · 2024-11-19T05:40:11Z

This PR adds support for detecting various failures scenarios that could leave agents inaccessible during agent auto-updates. teleport-update will rollback the agent version immediately in these cases.

This is implemented by monitoring /run/teleport.pid for changes that indicate different failure modes. For example, if Teleport crashes after a soft systemctl reload, systemd is unaware, and a stale PID can be detected in /run/teleport.pid with no running process. Alternatively, if Teleport crashes after a hard restart, the PID file is rapidly created/removed with different PID values that can be detected. Other cases, such as hanging on quit, are covered as well. This catches fatal errors in new versions, as well as client-too-new errors (e.g., v17 agent on v16 cluster).

Notably, connection failures, including clients rejected by the server for being outdated, do not trigger a revert.

Compared the previous version of this functionality implemented in teleport-ent-upgrader, this new logic detects changes faster (<30s vs. >5m), and rolls the version back instead of rolling it forward. Additionally, this logic detects failures after a soft reload, while the previous logic did not in most cases.

This is the eighth in a series of PRs implementing teleport-update:
Setup Command: #49174
Link Command: #48712
Update Command: #48244
Reloading with rollbacks: #47929
Linking: #47879
Enable Command: #47565
Initial scaffolding PR: #46418

The teleport-update binary will be used to enable, disable, and trigger automatic Teleport agent updates. The new auto-updates system manages a local installation of the cluster-specified version of Teleport stored in /var/lib/teleport/versions.

RFD: #47126
Goal (internal): https://github.com/gravitational/cloud/issues/10289

Example: Upgrading to v17 on a v16 cluster, with successful rollback.

Nov 19 02:44:44 legendary-mite systemd[1]: Starting teleport-update.service - Teleport auto-update service...
Nov 19 02:44:45 legendary-mite teleport-update[595947]: 2024-11-19T02:44:45Z INFO [UPDATER]   Update available. Initiating update. target_version:17.0.1 active_version:16.4.7 agent/updater.go:475
Nov 19 02:45:46 legendary-mite teleport-update[595947]: 2024-11-19T02:45:46Z INFO [UPDATER]   Version already present. version:17.0.1 agent/installer.go:153
Nov 19 02:45:46 legendary-mite teleport-update[595947]: 2024-11-19T02:45:46Z INFO [UPDATER]   Executing new teleport-update binary to update configuration. agent/updater.go:185
Nov 19 02:45:46 legendary-mite teleport-update[596859]: 2024-11-19T02:45:46Z INFO [UPDATER]   Systemd configuration synced. agent/process.go:253
Nov 19 02:45:46 legendary-mite teleport-update[596859]: 2024-11-19T02:45:46Z INFO [UPDATER]   Service enabled. unit:teleport-update.timer agent/process.go:270
Nov 19 02:45:46 legendary-mite teleport-update[595947]: 2024-11-19T02:45:46Z INFO [UPDATER]   Finished executing new teleport-update binary. agent/updater.go:187
Nov 19 02:45:47 legendary-mite teleport-update[595947]: 2024-11-19T02:45:47Z INFO [UPDATER]   Systemd configuration synced. agent/process.go:253
Nov 19 02:45:47 legendary-mite teleport-update[595947]: 2024-11-19T02:45:47Z INFO [UPDATER]   Target version successfully installed. target_version:17.0.1 agent/updater.go:568
Nov 19 02:45:47 legendary-mite teleport-update[595947]: 2024-11-19T02:45:47Z INFO [UPDATER]   Gracefully reloaded. unit:teleport.service agent/process.go:110
Nov 19 02:45:47 legendary-mite teleport-update[595947]: 2024-11-19T02:45:47Z INFO [UPDATER]   Monitoring PID file to detect crashes. unit:teleport.service agent/process.go:113
Nov 19 02:45:51 legendary-mite teleport-update[595947]: 2024-11-19T02:45:51Z WARN [UPDATER]   Detected stale PID. unit:teleport.service pid:597038 agent/process.go:194
Nov 19 02:45:57 legendary-mite teleport-update[595947]: 2024-11-19T02:45:57Z ERRO [UPDATER]   Reverting symlinks due to failed restart. agent/updater.go:578
Nov 19 02:45:57 legendary-mite teleport-update[595947]: 2024-11-19T02:45:57Z INFO [UPDATER]   Systemd configuration synced. agent/process.go:253
Nov 19 02:45:57 legendary-mite teleport-update[595947]: 2024-11-19T02:45:57Z ERRO [UPDATER]   [stderr] Job for teleport.service failed. agent/process.go:362
Nov 19 02:45:57 legendary-mite teleport-update[595947]: 2024-11-19T02:45:57Z ERRO [UPDATER]   [stderr] See "systemctl status teleport.service" and "journalctl -xeu teleport.service" for details. agent/process.go:368
Nov 19 02:45:57 legendary-mite teleport-update[595947]: 2024-11-19T02:45:57Z ERRO [UPDATER]   Error running systemctl. args:[reload teleport.service] code:1 agent/process.go:298
Nov 19 02:45:57 legendary-mite teleport-update[595947]: 2024-11-19T02:45:57Z WARN [UPDATER]   Service ungracefully restarted. Connections potentially dropped. unit:teleport.service agent/process.go:108
Nov 19 02:45:57 legendary-mite teleport-update[595947]: 2024-11-19T02:45:57Z INFO [UPDATER]   Monitoring PID file to detect crashes. unit:teleport.service agent/process.go:113
Nov 19 02:46:11 legendary-mite teleport-update[595947]: 2024-11-19T02:46:11Z WARN [UPDATER]   Teleport updater encountered an error during the update and successfully reverted the installation. agent/updater.go:586
Nov 19 02:46:11 legendary-mite teleport-update[595947]: ERROR: failed to start new version "17.0.1" of Teleport: detected crashing process
Nov 19 02:46:11 legendary-mite systemd[1]: teleport-update.service: Main process exited, code=exited, status=1/FAILURE
Nov 19 02:46:11 legendary-mite systemd[1]: teleport-update.service: Failed with result 'exit-code'.
Nov 19 02:46:11 legendary-mite systemd[1]: Failed to start teleport-update.service - Teleport auto-update service.
^C
ubuntu@legendary-mite:~$ ls -la /usr/local/bin/
total 12
drwxr-xr-x  2 root root 4096 Nov 19 02:45 .
drwxr-xr-x 11 root root 4096 Nov 13 01:41 ..
lrwxrwxrwx  1 root root   53 Nov 19 02:45 fdpass-teleport -> /var/lib/teleport/versions/16.4.7/bin/fdpass-teleport
lrwxrwxrwx  1 root root   42 Nov 19 02:45 tbot -> /var/lib/teleport/versions/16.4.7/bin/tbot
lrwxrwxrwx  1 root root   42 Nov 19 02:45 tctl -> /var/lib/teleport/versions/16.4.7/bin/tctl
lrwxrwxrwx  1 root root   46 Nov 19 02:45 teleport -> /var/lib/teleport/versions/16.4.7/bin/teleport
lrwxrwxrwx  1 root root   65 Nov 17 22:29 teleport-update -> /home/ubuntu/mounts/teleport/
ubuntu@legendary-mite:~$ systemctl status teleport
● teleport.service - Teleport Service
     Loaded: loaded (/usr/lib/systemd/system/teleport.service; enabled; preset: enabled)
     Active: active (running) since Tue 2024-11-19 03:09:20 UTC; 4min 42s ago

hugoShaka · 2024-11-19T14:25:26Z

lib/autoupdate/agent/process.go

+	g.Go(func() error {
+		return tickFile(ctx, s.PIDPath, pidC, tickC)
+	})
+	err := s.waitForStablePID(ctx, minRunningIntervalsBeforeStable, maxCrashesBeforeFailure,
+		initPID, pidC, func(pid int) error {
+			p, err := os.FindProcess(pid)
+			if err != nil {
+				return trace.Wrap(err)
+			}
+			return trace.Wrap(p.Signal(syscall.Signal(0)))
+		})
+	cancel()


nit: is the asynchronous approach necessary here?

Separating the PID file read and monitoring logic into separate gorountines made both parts easier to test. For example, with a single goroutine I'd have to write the PID file before or after the tick, when waitForStablePID could still be reading it. There are definitely other approaches to detecting whether waitForStablePID is finished reading (e.g., hooking verifyPID), but this was easiest for me to reason about.

Extract from other PR

eaedbbb

sclevine added the no-changelog Indicates that a PR does not require a changelog entry label Nov 19, 2024

sclevine requested review from hugoShaka and vapopov November 19, 2024 05:40

github-actions bot added the size/md label Nov 19, 2024

github-actions bot requested review from Joerger and r0mant November 19, 2024 05:40

sclevine mentioned this pull request Nov 19, 2024

[teleport-update] Add support for systemd process management #49102

Closed

sclevine added 2 commits November 19, 2024 01:00

comments

cae835a

string

6e0bdc8

hugoShaka approved these changes Nov 19, 2024

View reviewed changes

hugoShaka reviewed Nov 19, 2024

View reviewed changes

vapopov approved these changes Nov 19, 2024

View reviewed changes

sclevine added this pull request to the merge queue Nov 19, 2024

Merged via the queue into master with commit 4946e28 Nov 19, 2024
40 checks passed

sclevine deleted the sclevine/teleport-update-pid branch November 19, 2024 20:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[teleport-update] PID-based failure detection and rollback #49175

[teleport-update] PID-based failure detection and rollback #49175

sclevine commented Nov 19, 2024 •

edited

Loading

hugoShaka Nov 19, 2024

sclevine Nov 19, 2024 •

edited

Loading

[teleport-update] PID-based failure detection and rollback #49175

[teleport-update] PID-based failure detection and rollback #49175

Conversation

sclevine commented Nov 19, 2024 • edited Loading

hugoShaka Nov 19, 2024

Choose a reason for hiding this comment

sclevine Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

sclevine commented Nov 19, 2024 •

edited

Loading

sclevine Nov 19, 2024 •

edited

Loading