diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md new file mode 100644 index 0000000000000..ae843482bb225 --- /dev/null +++ b/rfd/0184-agent-auto-updates.md @@ -0,0 +1,1969 @@ +--- +authors: Stephen Levine (stephen.levine@goteleport.com) & Hugo Hervieux (hugo.hervieux@goteleport.com) +state: draft +--- + +# RFD 0184 - Agent Automatic Updates + +## Required Approvers + +* Engineering: @russjones +* Product: @klizhentas || @xinding33 +* Security: Doyensec + +## What + +This RFD proposes a new mechanism for scheduled, automatic updates of Teleport agents. + +Users of Teleport will be able to use the tctl CLI to specify desired versions, update schedules, and a rollout strategy. + +Agents will be updated by a new `teleport-update` binary, built from `tools/teleport-update` in the Teleport repository. + +All agent installations are in-scope for this proposal, including agents installed on Linux servers and Kubernetes. + +The following anti-goals are out-of-scope for this proposal, but will be addressed in future RFDs: +- Signing of agent artifacts (e.g., via TUF) +- Teleport Cloud APIs for updating agents +- Improvements to the local functionality of the Kubernetes agent for better compatibility with FluxCD and ArgoCD +- Support for progressive rollouts of tbot, when not installed on the same system as a Teleport agent + +This RFD proposes a specific implementation of several sections in https://github.com/gravitational/teleport/pull/39217. + +Additionally, this RFD parallels the auto-update functionality for client tools proposed in https://github.com/gravitational/teleport/pull/39805. + +## Why + +1. We want customers to run the latest release of Teleport so that they are secure and have access to the latest + features. +2. We do not want customers to deal with the pain of updating agents installed on their own infrastructure. +3. We want to reduce the operational cost of customers running old agents. + For Cloud customers, this will allow us to support fewer simultaneous cluster versions and reduce support load. + For self-hosted customers, this will reduce support load associated with debugging old versions of Teleport. +4. Providing 99.99% availability for customers requires us to maintain high availability at the agent-level + as well as the cluster-level. + +The current systemd updater does not meet those requirements: +- The use of package managers (apt and yum) to apply updates leads users to accidentally upgrade Teleport. +- The installation process is complex, and users often end up installing the wrong version of Teleport. +- The update process does not provide sufficient safeties to protect against broken agent updates. +- Customers decline to adopt the existing updater because they want more control over when updates occur. +- We do not offer a nice user experience for self-hosted users. This results in a marginal automatic updates + adoption and does not reduce the support cost associated with upgrading self-hosted clusters. + +## How + +The new agent automatic updates system will rely on a separate `teleport-update` binary controlling which Teleport version is +installed. Automatic updates will be implemented incrementally: + +- Phase 1: Introduce a new, self-updating updater binary which does not rely on package managers. Allow tctl to roll out updates to all agents. +- Phase 2: Add the ability for the agent updater to immediately revert a faulty update. +- Phase 3: Introduce the concept of agent update groups and make users chose the order in which groups are updated. +- Phase 4: Add a feedback mechanism for the Teleport inventory to track the agents of each group and their update status. +- Phase 5: Add the canary deployment strategy: a few agents are updated first, if they don't die, the whole group is updated. +- Phase 6: Add the ability to perform slow and incremental version rollouts within an agent update group. +- Phase 7: If needed, backup local agent DB and restore during agent rollbacks. + +The updater will be usable after phase 1 and will gain new capabilities after each phase. +After phase 2, the new updater will have feature-parity with the existing updater script. +The existing auto-updates mechanism will remain unchanged and fully-functional throughout the process. +It will be deprecated in the future. + +Future phases might change as we are working on the implementation and collecting real-world feedback and experience. + +We will introduce two user-facing resources: + +1. The `autoupdate_config` resource, owned by the Teleport user. This resource allows Teleport users to configure: + - Whether automatic updates are enabled, disabled, or temporarily suspended + - The order in which agents should be updated (`dev` before `staging` before `prod`) + - Days and hours when agent updates should start + - Configuration for client auto-updates (e.g., `tsh` and `tctl`), which are out-of-scope for this RFD + + The resource will look like: + ```yaml + kind: autoupdate_config + spec: + # existing field, deprecated + tools_autoupdate: true/false + # new fields + tools: + mode: enabled/disabled + agents: + mode: enabled/disabled/suspended + schedules: + regular: + - name: dev + days: ["Mon", "Tue", "Wed", "Thu"] + start_hour: 0 + alert_after: 4h + canary_count: 5 # added in phase 5 + max_in_flight: 20% # added in phase 6 + - name: prod + days: ["Mon", "Tue", "Wed", "Thu"] + start_hour: 0 + wait_days: 1 # update this group at least 1 day after the previous one + alert_after: 4h + canary_count: 5 # added in phase 5 + max_in_flight: 20% # added in phase 6 + ``` + +2. The `autoupdate_version` resource, with `spec` owned by the Teleport cluster administrator (e.g. Teleport Cloud operators). + ```yaml + kind: autoupdate_version + spec: + tools: + target_version: vX + agents: + start_version: v1 + target_version: v2 + schedule: regular + strategy: halt-on-failure + mode: enabled + ``` + +We will also introduce an internal resource, tracking the agent rollout status. This resource is +owned by Teleport. Users and cluster operators can read its content but cannot create/update/upsert/delete it. +This resource is editable via select RPCs (e.g. start or rollback a group). + +The system will look like: + +```mermaid +flowchart TD + user(fa:fa-user User) + operator(fa:fa-user Operator) + auth[Auth Service] + proxy[Proxy Service] + updater[teleport-updater] + agent[Teleport Agent] + + autoupdate_config@{shape: notch-rect} + autoupdate_version@{shape: notch-rect} + autoupdate_rollout@{shape: notch-rect} + updater_status@{shape: notch-rect, label: "updater.yaml"} + + user -->|defines update schedule| autoupdate_config + operator -->|choses target version| autoupdate_version + autoupdate_config --> auth + autoupdate_version --> auth + auth -->|Describes desired state for each agent group| autoupdate_rollout + autoupdate_rollout --> proxy + proxy -->|Serves update instructions
via /find| updater + updater -->|Writes status| updater_status + updater_status --> agent + agent -->|Reports version and status via HelloMessage and InstanceHeartbeat| auth +``` + +You can find more details about each resource field [in the dedicated resource section](#teleport-resources). + +## Details + +This section contains the proposed implementation details and is mainly relevant for Teleport developers and curious +users who want to know the motivations behind this specific design. + +### Product requirements + +The following product requirements were defined by our leadership team: + +1. Phased rollout for Cloud tenants. We should be able to control the agent version per-tenant. + +2. Bucketed rollout that customers have control over. + - Control the bucket update day + - Control the bucket update hour + - Ability to pause a rollout + +3. Customers should be able to run "apt-get upgrade" without updating Teleport. + + Installation from a package manager should be possible, but the version should still be controlled by Teleport. + +4. Self-managed updates should be a first class citizen. Teleport must advertise the desired agent and client version. + +5. Self-hosted customers should be supported, for example, customers whose own internal customer is running a Teleport agent. + +6. Upgrading leaf clusters is out-of-scope. + +7. Rolling back after a broken update should be supported. Roll forward gets you 99.9%, we need rollback for 99.99%. + +8. We should have high quality metrics that report the version they are running and if they are running automatic + updates. For both users and us. + +9. Best effort should be made so automatic updates should be applied in a way that sessions are not terminated. (Currently only supported for SSH) + +10. All backends should be supported. + +11. Teleport Discover installation (curl one-liner) should be supported. + +12. We need to support Docker image repository mirrors and Teleport artifact mirrors. + +13. I should be able to install an auto-updating deployment of Teleport via whatever mechanism I want to, including OS packages such as apt and yum. + +14. If new instances join a bucket outside the upgrade window, and you are within your compatibility window, wait until your next group update start. + If you are not within your compatibility window, attempt to upgrade right away. + +15. If an agent comes back online after some period of time, and it is still compatible with + control plane, it should wait until the next upgrade window to be upgraded. + +16. Regular agent updates for Cloud tenants should complete in less than a week. + (Select tenants may support longer schedules, at the Cloud team's discretion.) + +17. A Cloud customer should be able to pause, resume, and rollback and existing rollout schedule. + A Cloud customer should not be able to create new rollout schedules. + + Teleport can create as many rollout schedules as it wants. + +18. A user logged-in to the agent host should be able to disable agent auto-updates and pin a version for that particular host. + +### User Stories + +#### As a Teleport Cloud operator I want to be able to update customers agents to a newer Teleport version + +
+Before + +```shell +tctl autoupdate agent status +# Rollout plan created the YYYY-MM-DD +# Previous version: v1 +# New version: v2 +# Status: enabled +# +# Group Name Status Update Start Time Connected Agents Up-to-date agents failed updates +# ---------- ----------------- ----------------- ---------------- ----------------- -------------- +# dev complete YYYY-MM-DD HHh 120 115 2 +# staging complete YYYY-MM-D2 HHh 20 20 0 +# prod not started 234 0 0 +``` +
+ +I run +```bash +tctl autoupdate agent-plan new-target v3 +# created new rollout from v2 to v3 +``` + +
+After + +```shell +tctl autoupdate agent status +# Rollout plan created the YYYY-MM-DD +# Previous version: v2 +# New version: v3 +# Status: enabled +# +# Group Name Status Update Start Time Connected Agents Up-to-date agents failed updates +# ---------- ----------------- ----------------- ---------------- ----------------- -------------- +# dev not started 120 0 0 +# staging not started 20 0 0 +# prod not started 234 0 0 +``` + +
+ +Now, new agents will install v2 by default, and v3 after the maintenance. + +> [!NOTE] +> If the previous maintenance was not finished, I will install v2 on new prod agents while the rest of prod is still running v1. +> This is expected as we don't want to keep track of an infinite number of versions. +> +> If this is an issue I can create a v1 -> v3 rollout instead. +> +> ```bash +> tctl autoupdate agent-plan new-target v3 --previous-version v1 +> # created new update plan from v1 to v3 +> ``` + +#### As a Teleport Cloud operator I want to minimize damage caused by broken versions to ensure we maintain 99.99% availability + +##### Failure mode 1(a): the new version crashes + +I create a new deployment with a broken version. The version is deployed to a few instances picked randomly. +Those instances are called the canaries. As the new version has an issue, one or many of those canary instances can't run the +new version and their updater has to revert to the previous one. The agents connect back online and +advertise they have failed to update. The maintenance is stuck until every instance that got selected to test the new version +is back online, and running the new version. + +
+Autoupdate agent rollout + +```yaml +kind: autoupdate_agent_rollout +spec: + version_config: + start_version: v1 + target_version: v2 + schedule: regular + strategy: halt-on-failure + mode: enabled +status: + groups: + - name: dev + start_time: 2020-12-09T16:09:53+00:00 + initial_count: 100 + present_count: 100 + failed_count: 0 + progress: 0 + state: canaries + canaries: + - updater_uuid: abc + host_uuid: def + hostname: foo.example.com + success: false + last_update_time: 2020-12-10T16:09:53+00:00 + last_update_reason: canaryTesting + - name: staging + start_time: 0000-00-00 + initial_count: 0 + present_count: 0 + failed_count: 0 + progress: 0 + state: unstarted + last_update_time: 2020-12-10T16:09:53+00:00 + last_update_reason: newAgentPlan +``` +
+ +I and the customer get an alert if the test instances are not running the expected version after an hour. +Teleport cloud operators and the customer can look up the hostname and host UUID of the test instances +to identify which one(s) failed to update and go troubleshoot. + +Customers receive cluster alerts, while Cloud receives alerts driven by Teleport metrics. + +The rollout resumes. + +If the issue is related to a specific instance and not the new Teleport version (e.g. VM out of disk space), +the user can instruct teleport to pick 5 new canary instances. + +##### Failure mode 1(b): the new version crashes, but not on the canaries + +This scenario is the same as the previous one but the Teleport agent bug only manifests on select agents. +For example: [the agent fails to read cloud-provider specific metadata and crashes](https://github.com/gravitational/teleport/issues/42312). +This can also be caused by a specific Teleport service crashing. For example, the discovery service is crashing but +all other services are OK. As most instances are running ssh_service, the discovery_service instances are less likely +to get picked. + +The version is deployed to a few instances picked randomly but none of them runs on the affected cloud provider. +The canary instances can update properly and the update is sent to every instance of the group. + +All agents are updated, and all agents hosted on the cloud provider affected by the bug crash. +The updaters of the affected agents will attempt to self-heal by reverting to the previous version. + +Once the previous Teleport version is running, the agents from the affected cloud platform will advertise the update +failed, and they had to rollback. + +If too many agents failed, this will block the group from transitioning from `active` to `done`, protecting the future +groups from the faulty updates. + +##### Failure mode 2(a): the new version crashes, and the old version cannot start + +I create a new deployment with a broken version. The version is deployed to a few instances picked randomly. +Those instances are called the canaries. As the new version has an issue, one or many of those canary instances can't +run the new version. Their updater also fails to revert to the previous version. + +The group update is stuck until the canary comes back online and runs the latest version. + +The customer and Teleport cloud receive an alert. Both customer and Teleport cloud can retrieve the +host id and hostname of the faulty canary instances. With this information they can go troubleshoot the failed agents. + +##### Failure mode 2(b): the new version crashes, and the old version cannot start, but not on the canaries + +This scenario is the same as the previous one but the Teleport agent bug only manifests on select agents. +For example: a clock drift blocks agents from re-connecting to Teleport. + +The canaries might not select one of the affected agent and allow the update to proceed. +All agents are updated, and all agents hosted on the cloud provider affected by the bug crash. +The updater fails to self-heal as the old version does not start anymore. + +If too many agents fail, this will block the group from transitioning from `active` to `done`, protecting the future +groups from the faulty updates. + +In this case, it's hard to identify which agent dropped. + +##### Failure mode 3: shadow failure + +Teleport cloud deploys a new version. Agents from the first group get updated. +The agents are seemingly running properly, but some functions are impaired. +For example, host user creation is failing. + +Some user tries to access a resource served by the agent, it fails and the user +notices the disruption. + +The customer can observe the agent update status and see that a recent update +might have caused this: + +```shell +tctl autoupdate agent status +# Rollout plan created the YYYY-MM-DD +# Previous version: v2 +# New version: v3 +# Status: enabled +# +# Group Name Status Update Start Time Connected Agents Up-to-date agents failed updates +# ---------- ----------------- ----------------- ---------------- ----------------- -------------- +# dev complete YYYY-MM-DD HHh 120 115 2 +# staging in progress (53%) YYYY-MM-D2 HHh 20 10 0 +# prod not started 234 0 0 +``` + +Then, the customer or Teleport Cloud team can suspend the rollout: + +```shell +tctl autoupdate agent suspend +# Automatic updates suspended +# No existing agent will get updated. New agents might install the new version +# depending on their group. +``` + +At this point, no new agent is updated to reduce the service disruption. +The customer can investigate, and get help from Teleport's support via a support ticket. +If the update is really the cause of the issue, the customer or Teleport cloud can perform a rollback: + +```shell +tctl autoupdate agent rollback +# Rolledback groups: [dev, staging] +# Warning: the automatic agent updates are suspended. +# Agents will not rollback until you run: +# $> tctl autoupdate agent resume +``` + +> [!NOTE] +> By default, all groups not in the "unstarted" state are rolledback. +> It is also possible to rollback only specific groups. + +
+After: + +```shell +tctl autoupdate agent status +# Rollout plan created the YYYY-MM-DD +# Previous version: v2 +# New version: v3 +# Status: suspended +# +# Group Name Status Update Start Time Connected Agents Up-to-date agents failed updates +# ---------- ----------------- ----------------- ---------------- ----------------- -------------- +# dev rolledback YYYY-MM-DD HHh 120 115 2 +# staging rolledback YYYY-MM-D2 HHh 20 10 0 +# prod not started 234 0 0 +``` +
+ +Finally, when the user is happy with the new plan, they can resume the updates. +This will trigger the rollback. + +```shell +tctl autoupdate agent resume +``` + +#### As a Teleport user and a Teleport on-call responder, I want to be able to pin a specific Teleport version of an agent to understand if a specific behaviour is caused by a specific Teleport version + +I connect to the server and lookup its status: +```shell +teleport-update status +# Running version v16.2.5 +# Automatic updates enabled. +# Proxy: example.teleport.sh +# Group: staging +``` + +I try to set a specific version: +```shell +teleport-update use-version v16.2.3 +# Error: the instance is enrolled into automatic updates. +# You must specify --disable-automatic-updates to opt this agent out of automatic updates and manually control the version. +``` + +I acknowledge that I am leaving automatic updates: +```shell +teleport-update use-version v16.2.3 --disable-automatic-updates +# Disabling automatic updates. You can re-enable them by running `teleport-update enable` +# Downloading version 16.2.3 +# Restarting teleport +# Cleaning up old binaries +``` + +When the issue is fixed, I can enroll back into automatic updates: + +```shell +teleport-update enable +# Enabling automatic updates +# Proxy: example.teleport.sh +# Group: staging +``` + +#### As a Teleport user I want to fast-track a group update + +I have a new rollout, completely unstarted, and my current maintenance schedule updates over several days. +However, the new version contains something that I need as soon as possible (e.g., a fix for a bug that affects me). + +
+Before: + +```shell +tctl autoupdate agent status +# Rollout plan created the YYYY-MM-DD +# Previous version: v2 +# New version: v3 +# Status: enabled +# +# Group Name Status Update Start Time Connected Agents Up-to-date agents failed updates +# ---------- ----------------- ----------------- ---------------- ----------------- -------------- +# dev not started 120 0 0 +# staging not started 20 0 0 +# prod not started 234 0 0 +``` +
+ +I can trigger the dev group immediately using the command: + +```shell +tctl autoupdate agent start-update dev [--force] +# Dev group update triggered. +``` + +The `--force` flag allows the user to skip progressive deployment mechanism such as canaries or backpressure. + +Alternatively +```shell +tctl autoupdate agent mark-done dev +``` + +
+After: + +```shell +tctl autoupdate agent status +# Rollout plan created the YYYY-MM-DD +# Previous version: v2 +# New version: v3 +# Status: enabled +# +# Group Name Status Update Start Time Connected Agents Up-to-date agents failed updates +# ---------- ----------------- ----------------- ---------------- ----------------- -------------- +# dev not started 120 0 0 +# staging not started 20 0 0 +# prod not started 234 0 0 +``` +
+ +#### As a Teleport user, I want to install a new agent automatically updated + +The manual way: + +```bash +wget https://cdn.teleport.dev/teleport-update-- +chmod +x teleport-update +./teleport-update enable --proxy example.teleport.sh --group production +# Detecting the Teleport version and edition used by cluster "example.teleport.sh" +# Installing the following teleport version: +# Version: 16.2.1 +# Edition: Enterprise +# OS: Linux +# Architecture: x86 +# Teleport installed +# Enabling automatic updates, the agent is part of the "production" update group. +# You can now configure the teleport agent with `teleport configure` or by writing your own `teleport.yaml`. +# When the configuration is done, enable and start teleport by running: +# `systemctl start teleport && systemctl enable teleport` +``` + +The one-liner: + +``` +curl https://cdn.teleport.dev/auto-install | bash -s example.teleport.sh +# Downloading the teleport updater +# Detecting the Teleport version and edition used by cluster "example.teleport.sh" +# Installing the following teleport version: +# Version: 16.2.1 +# Edition: Enterprise +# OS: Linux +# Architecture: x86 +# Teleport installed +# Enabling automatic updates, the agent is part of the "default" update group. +# You can now configure the teleport agent with `teleport configure` or by writing your own `teleport.yaml`. +# When the configuration is finished, enable and start teleport by running: +# `systemctl start teleport && systemctl enable teleport` +``` + +I can also install teleport using the package manager, then enroll the agent into AUs. See the section below: + +#### As a Teleport user I want to enroll my existing agent into AUs + +I have an agent, installed from a package manager or by manually unpacking the tarball. +This agent might or might not be enrolled in the previous automatic update mechanism (apt/yum-based). +I have the teleport updater installed and available in my path. +I run: + +```shell +teleport-update enable --group production +# Detecting the Teleport version and edition used by cluster "example.teleport.sh" +# Installing the following teleport version: +# Version: 16.2.1 +# Edition: Enterprise +# OS: Linux +# Architecture: x86 +# Teleport installed, reloading the service. +# Enabling automatic updates, the agent is part of the "production" update group. +``` + +> [!NOTE] +> The updater saw the teleport unit running and the existing teleport configuration. +> It used the configuration to pick the right proxy address. As teleport is already running, the teleport service is +> reloaded to use the new binary. + +If the agent was previously enrolled into AUs with the old teleport updater package, the `enable` command will also +remove the old package. + +### Teleport Resources + +#### Autoupdate Config + +This resource is owned by the Teleport cluster user. +This is how Teleport customers can specify their automatic update preferences. + +```yaml +kind: autoupdate_config +spec: + # existing field, deprecated + tools_autoupdate: true + tools: + mode: enabled/disabled + agents: + # agent_auto_update allows turning agent updates on or off at the + # cluster level. Only turn agent automatic updates off if self-managed + # agent updates are in place. Setting this to pause will temporarily halt the rollout. + mode: enabled/disabled/suspended + # strategy to use for the rollout + # Supported values are: + # - time-based + # - halt-on-failure + # - halt-on-failure-with-backpressure + # defaults to halt-on-failure, might default to halt-on-failure-with-backpressure after phase 6. + strategy: halt-on-failure + # agent_schedules specifies version rollout schedules for agents. + # The schedule used is determined by the schedule associated + # with the version in the autoupdate_version resource. + # For now, only the "regular" schedule is configurable. + schedules: + regular: + # name of the group. Must only contain valid backend / resource name characters. + - name: staging + # days specifies the days of the week when the group may be updated. + # mandatory value for most Cloud customers: ["Mon", "Tue", "Wed", "Thu"] + # default: ["*"] (all days) + days: [ “Sun”, “Mon”, ... | "*" ] + # start_hour specifies the hour when the group may start upgrading. + # default: 0 + start_hour: 0-23 + # wait_days specifies how many days to wait after the previous group finished before starting. + # This must be 0 when using the `time-based` strategy. + # default: 0 + wait_days: 0-1 + # canary_count specifies the desired number of canaries to update before any other agents + # are updated. + # default: 5 + canary_count: 0-10 + # max_in_flight specifies the maximum number of agents that may be updated at the same time. + # Only valid for the backpressure strategy. + # default: 20% + max_in_flight: 10-100% + # alert_after specifies the duration after which a cluster alert will be set if the group update has + # not completed. + # default: 4 + alert_after_hours: 1-8 + # ... +``` + +Default resource: +```yaml +kind: autoupdate_config +spec: + tools: + mode: enabled + agents: + mode: enabled + strategy: halt-on-failure + alert_after: 4h + schedules: + regular: + - name: default + days: ["Mon", "Tue", "Wed", "Thu"] + start_hour: 0 + canary_count: 5 + max_in_flight: 20% +``` + +#### Autoupdate version + +The `autoupdate_version` spec is owned by the Teleport cluster administrator. +In Teleport Cloud, this is the Cloud operations team. For self-hosted setups this is the user with access to the local +admin socket (tctl on local machine). + +> [!NOTE] +> This is currently an anti-pattern as we are trying to remove the use of the local administrator in Teleport. +> However, Teleport does not provide any role/permission that we can use for Teleport Cloud operations and cannot be +> granted to users. To part with local admin rights, we need a way to have Cloud or admin-only operations. +> This would also improve Cloud team operations by interacting with Teleport API rather than executing local tctl. +> +> Solving this problem is out of the scope of this RFD. + +```yaml +kind: autoupdate_version +spec: + tools: + target_version: vX + agents: + # start_version is the desired version for agents before their window. + start_version: v1 + # target_version is the desired version for agents after their window. + target_version: v2 + # schedule to use for the rollout + schedule: regular + # paused specifies whether the rollout is paused + # default: enabled + mode: enabled|disabled|suspended +``` + +#### Autoupdate agent rollout + +The `autoupdate_agent_rollout` resource is owned by Teleport. This resource can be read by users but not directly applied. +To create and reconcile this resource, the Auth service looks up bot `autoupdate_config` and `autoupdate_version` to know the desired mode, versions, and schedule. +Once the agent rollout is created, the auth uses its status to track the progress of the rollout through the different groups. + +```yaml +kind: autoupdate_agent_rollout +spec: + # content copied from the `autoupdate_version.spec.agents` + version_config: + start_version: v1 + target_version: v2 + schedule: regular + strategy: halt-on-failure + mode: enabled +status: + groups: + # name of group + - name: staging + # start_time is the time the upgrade will start + start_time: 2020-12-09T16:09:53+00:00 + # initial_count is the number of connected agents at the start of the window + initial_count: 432 + # missing_count is the number of agents disconnected since the start of the rollout + present_count: 53 + # failed_count is the number of agents rolled-back since the start of the rollout + failed_count: 23 + # canaries is a list of agents used for canary deployments + canaries: # part of phase 5 + # updater_uuid is the updater UUID + - updater_uuid: abc123-... + # host_uuid is the agent host UUID + host_uuid: def534-... + # hostname of the agent + hostname: foo.example.com + # success status + success: false + # progress is the current progress through the rollout + progress: 0.532 + # state is the current state of the rollout (unstarted, active, done, rollback) + state: active + # last_update_time is the time of the previous update for the group + last_update_time: 2020-12-09T16:09:53+00:00 + # last_update_reason is the trigger for the last update + last_update_reason: rollback +``` + +#### Protobuf + +```protobuf +syntax = "proto3"; + +package teleport.autoupdate.v1; + +import "teleport/header/v1/metadata.proto"; +import "google/protobuf/empty.proto"; +import "google/protobuf/timestamp.proto"; + +option go_package = "github.com/gravitational/teleport/api/gen/proto/go/teleport/autoupdate/v1;autoupdate"; + +// CONFIG + +// AutoUpdateConfig is a config singleton used to configure cluster +// autoupdate settings. +message AutoUpdateConfig { + string kind = 1; + string sub_kind = 2; + string version = 3; + teleport.header.v1.Metadata metadata = 4; + + AutoUpdateConfigSpec spec = 5; +} + +// AutoUpdateConfigSpec encodes the parameters of the autoupdate config object. +message AutoUpdateConfigSpec { + reserved 1; + AutoUpdateConfigSpecTools tools = 2; + AutoUpdateConfigSpecAgents agents = 3; +} + +// AutoUpdateConfigSpecTools encodes the parameters of automatic tools update. +message AutoUpdateConfigSpecTools { + // Mode encodes the feature flag to enable/disable tools autoupdates. + Mode mode = 1; +} + +// AutoUpdateConfigSpecTools encodes the parameters of automatic tools update. +message AutoUpdateConfigSpecAgents { + // mode specifies whether agent autoupdates are enabled, disabled, or paused. + Mode agent_auto_update_mode = 1; + // strategy to use for updating the agents. + Strategy strategy = 2; + // maintenance_window_minutes is the maintenance window duration in minutes. This can only be set if `strategy` is "time-based". + int64 maintenance_window_minutes = 3; + // alert_after_hours specifies the number of hours to wait before alerting that the rollout is not complete. + // This can only be set if `strategy` is "halt-on-failure". + int64 alert_after_hours = 5; + // agent_schedules specifies schedules for updates of grouped agents. + AgentAutoUpdateSchedules agent_schedules = 6; +} + +// Strategy type for the rollout +enum Strategy { + // UNSPECIFIED update strategy + STRATEGY_UNSPECIFIED = 0; + // PREVIOUS_MUST_SUCCEED update strategy with no backpressure + STRATEGY_HALT_ON_FAILURE = 1; + // TIME_BASED update strategy. + STRATEGY_TIME_BASED = 2; +} + +// AgentAutoUpdateSchedules specifies update scheduled for grouped agents. +message AgentAutoUpdateSchedules { + // regular schedules for non-critical versions. + repeated AgentAutoUpdateGroup regular = 1; +} + +// AgentAutoUpdateGroup specifies the update schedule for a group of agents. +message AgentAutoUpdateGroup { + // name of the group + string name = 1; + // days to run update + repeated Day days = 2; + // start_hour to initiate update + int32 start_hour = 3; + // wait_days after last group succeeds before this group can run. This can only be used when the strategy is "halt-on-failure". + int64 wait_days = 4; + // canary_count of agents to use in the canary deployment. + int64 canary_count = 5; + // max_in_flight specifies agents that can be updated at the same time, by percent. + string max_in_flight = 6; +} + +// Day of the week +enum Day { + DAY_UNSPECIFIED = 0; + DAY_ALL = 1; + DAY_SUNDAY = 2; + DAY_MONDAY = 3; + DAY_TUESDAY = 4; + DAY_WEDNESDAY = 5; + DAY_THURSDAY = 6; + DAY_FRIDAY = 7; + DAY_SATURDAY = 8; +} + +// Mode of operation +enum Mode { + // UNSPECIFIED update mode + MODE_UNSPECIFIED = 0; + // DISABLE updates + MODE_DISABLE = 1; + // ENABLE updates + MODE_ENABLE = 2; + // PAUSE updates + MODE_PAUSE = 3; +} + +// Schedule type for the rollout +enum Schedule { + // UNSPECIFIED update schedule + SCHEDULE_UNSPECIFIED = 0; + // REGULAR update schedule + SCHEDULE_REGULAR = 1; + // IMMEDIATE update schedule for updating all agents immediately + SCHEDULE_IMMEDIATE = 2; +} + +// VERSION + +// AutoUpdateVersion is a resource singleton with version required for +// tools autoupdate. +message AutoUpdateVersion { + string kind = 1; + string sub_kind = 2; + string version = 3; + teleport.header.v1.Metadata metadata = 4; + + AutoUpdateVersionSpec spec = 5; +} + +// AutoUpdateVersionSpec encodes the parameters of the autoupdate versions. +message AutoUpdateVersionSpec { + // ToolsVersion is the semantic version required for tools autoupdates. + reserved 1; + AutoUpdateVersionSpecTools tools = 2; + AutoUpdateVersionSpecAgents agents = 3; +} + +// AutoUpdateVersionSpecTools is the spec for the autoupdate version. +message AutoUpdateVersionSpecTools { + // target_version is the target tools version. + string target_version = 1; +} + +// AutoUpdateVersionSpecAgents is the spec for the autoupdate version. +message AutoUpdateVersionSpecAgents { + // start_version is the version to update from. + string start_version = 1; + // target_version is the version to update to. + string target_version = 2; + // schedule to use for the rollout + Schedule schedule = 3; + // autoupdate_mode to use for the rollout + Mode autoupdate_mode = 4; +} + +// AGENT ROLLOUT + +message AutoUpdateAgentRollout { + string kind = 1; + string sub_kind = 2; + string version = 3; + teleport.header.v1.Metadata metadata = 4; + AutoUpdateAgentRolloutSpec spec = 5; + AutoUpdateAgentRolloutStatus status = 6; +} + +message AutoUpdateAgentRolloutSpec { + // start_version is the version to update from. + string start_version = 1; + // target_version is the version to update to. + string target_version = 2; + // schedule to use for the rollout + Schedule schedule = 3; + // autoupdate_mode to use for the rollout + Mode autoupdate_mode = 4; + // strategy to use for updating the agents. + Strategy strategy = 5; +} + +message AutoUpdateAgentRolloutStatus { + repeated AutoUpdateAgentRolloutStatusGroup groups = 1; +} + +message AutoUpdateAgentRolloutStatusGroup { + // name of the group + string name = 1; + // start_time of the rollout + google.protobuf.Timestamp start_time = 2; + // initial_count is the number of connected agents at the start of the window. + int64 initial_count = 3; + // present_count is the current number of connected agents. + int64 present_count = 4; + // failed_count specifies the number of failed agents. + int64 failed_count = 5; + // canaries is a list of canary agents. + repeated Canary canaries = 6; + // progress is the current progress through the rollout. + float progress = 7; + // state is the current state of the rollout. + State state = 8; + // last_update_time is the time of the previous update for this group. + google.protobuf.Timestamp last_update_time = 9; + // last_update_reason is the trigger for the last update + string last_update_reason = 10; +} + +// Canary agent +message Canary { + // update_uuid of the canary agent + string update_uuid = 1; + // host_uuid of the canary agent + string host_uuid = 2; + // hostname of the canary agent + string hostname = 3; + // success state of the canary agent + bool success = 4; +} + +// State of the rollout +enum State { + // UNSPECIFIED state + STATE_UNSPECIFIED = 0; + // UNSTARTED state + STATE_UNSTARTED = 1; + // CANARY state + STATE_CANARY = 2; + // ACTIVE state + STATE_ACTIVE = 3; + // DONE state + STATE_DONE = 4; + // ROLLEDBACK state + STATE_ROLLEDBACK = 5; +} + +// AutoUpdateService provides an API to manage autoupdates. +service AutoUpdateService { + // GetAutoUpdateConfig gets the current autoupdate config singleton. + rpc GetAutoUpdateConfig(GetAutoUpdateConfigRequest) returns (AutoUpdateConfig); + + // CreateAutoUpdateConfig creates a new AutoUpdateConfig. + rpc CreateAutoUpdateConfig(CreateAutoUpdateConfigRequest) returns (AutoUpdateConfig); + + // CreateAutoUpdateConfig updates AutoUpdateConfig singleton. + rpc UpdateAutoUpdateConfig(UpdateAutoUpdateConfigRequest) returns (AutoUpdateConfig); + + // UpsertAutoUpdateConfig creates a new AutoUpdateConfig or replaces an existing AutoUpdateConfig. + rpc UpsertAutoUpdateConfig(UpsertAutoUpdateConfigRequest) returns (AutoUpdateConfig); + + // DeleteAutoUpdateConfig hard deletes the specified AutoUpdateConfig. + rpc DeleteAutoUpdateConfig(DeleteAutoUpdateConfigRequest) returns (google.protobuf.Empty); + + // GetAutoUpdateVersion gets the current autoupdate version singleton. + rpc GetAutoUpdateVersion(GetAutoUpdateVersionRequest) returns (AutoUpdateVersion); + + // CreateAutoUpdateVersion creates a new AutoUpdateVersion. + rpc CreateAutoUpdateVersion(CreateAutoUpdateVersionRequest) returns (AutoUpdateVersion); + + // UpdateAutoUpdateVersion updates AutoUpdateVersion singleton. + rpc UpdateAutoUpdateVersion(UpdateAutoUpdateVersionRequest) returns (AutoUpdateVersion); + + // UpsertAutoUpdateVersion creates a new AutoUpdateVersion or replaces an existing AutoUpdateVersion. + rpc UpsertAutoUpdateVersion(UpsertAutoUpdateVersionRequest) returns (AutoUpdateVersion); + + // DeleteAutoUpdateVersion hard deletes the specified AutoUpdateVersionRequest. + rpc DeleteAutoUpdateVersion(DeleteAutoUpdateVersionRequest) returns (google.protobuf.Empty); + + // GetAutoUpdateAgentRollout gets the current autoupdate version singleton. + rpc GetAutoUpdateAgentRollout(GetAutoUpdateAgentRolloutRequest) returns (AutoUpdateAgentRollout); + + // CreateAutoUpdateAgentRollout creates a new AutoUpdateAgentRollout. + rpc CreateAutoUpdateAgentRollout(CreateAutoUpdateAgentRolloutRequest) returns (AutoUpdateAgentRollout); + + // UpdateAutoUpdateAgentRollout updates AutoUpdateAgentRollout singleton. + rpc UpdateAutoUpdateAgentRollout(UpdateAutoUpdateAgentRolloutRequest) returns (AutoUpdateAgentRollout); + + // UpsertAutoUpdateAgentRollout creates a new AutoUpdateAgentRollout or replaces an existing AutoUpdateAgentRollout. + rpc UpsertAutoUpdateAgentRollout(UpsertAutoUpdateAgentRolloutRequest) returns (AutoUpdateAgentRollout); + + // DeleteAutoUpdateAgentRollout hard deletes the specified AutoUpdateAgentRolloutRequest. + rpc DeleteAutoUpdateAgentRollout(DeleteAutoUpdateAgentRolloutRequest) returns (google.protobuf.Empty); + + // TriggerAgentGroup changes the state of an agent group from `unstarted` to `active` or `canary`. + rpc TriggerAgentGroup(TriggerAgentGroupRequest) returns (AutoUpdateAgentRollout); + // ForceAgentGroup changes the state of an agent group from `unstarted`, `canary`, or `active` to the `done` state. + rpc ForceAgentGroup(ForceAgentGroupRequest) returns (AutoUpdateAgentRollout); + // ResetAgentGroup resets the state of an agent group. + // For `canary`, this means new canaries are picked + // For `active`, this means the initial instance count is computed again. + rpc ResetAgentGroup(ResetAgentGroupRequest) returns (AutoUpdateAgentRollout); + // RollbackAgentGroup changes the state of an agent group to `rolledback`. + rpc RollbackAgentGroup(RollbackAgentGroupRequest) returns (AutoUpdateAgentRollout); +} + +// Request for GetAutoUpdateConfig. +message GetAutoUpdateConfigRequest {} + +// Request for CreateAutoUpdateConfig. +message CreateAutoUpdateConfigRequest { + AutoUpdateConfig config = 1; +} + +// Request for UpdateAutoUpdateConfig. +message UpdateAutoUpdateConfigRequest { + AutoUpdateConfig config = 1; +} + +// Request for UpsertAutoUpdateConfig. +message UpsertAutoUpdateConfigRequest { + AutoUpdateConfig config = 1; +} + +// Request for DeleteAutoUpdateConfig. +message DeleteAutoUpdateConfigRequest {} + +// Request for GetAutoUpdateVersion. +message GetAutoUpdateVersionRequest {} + +// Request for CreateAutoUpdateVersion. +message CreateAutoUpdateVersionRequest { + AutoUpdateVersion version = 1; +} + +// Request for UpdateAutoUpdateConfig. +message UpdateAutoUpdateVersionRequest { + AutoUpdateVersion version = 1; +} + +// Request for UpsertAutoUpdateVersion. +message UpsertAutoUpdateVersionRequest { + AutoUpdateVersion version = 1; +} + +// Request for DeleteAutoUpdateVersion. +message DeleteAutoUpdateVersionRequest {} + +// Request for GetAutoUpdateAgentRollout. +message GetAutoUpdateAgentRolloutRequest {} + +// Request for CreateAutoUpdateAgentRollout. +message CreateAutoUpdateAgentRolloutRequest { + AutoUpdateAgentRollout plan = 1; +} + +// Request for UpdateAutoUpdateConfig. +message UpdateAutoUpdateAgentRolloutRequest { + AutoUpdateAgentRollout plan = 1; +} + +// Request for UpsertAutoUpdateAgentRollout. +message UpsertAutoUpdateAgentRolloutRequest { + AutoUpdateAgentRollout plan = 1; +} + +// Request for DeleteAutoUpdateAgentRollout. +message DeleteAutoUpdateAgentRolloutRequest {} + +message TriggerAgentGroupRequest { + // group is the agent update group name whose maintenance should be triggered. + string group = 1; + // desired_state describes the desired start state. + // Supported values are STATE_UNSPECIFIED, STATE_CANARY, and STATE_ACTIVE. + // When left empty, defaults to canary if they are supported. + State desired_state = 2; +} + +message ForceAgentGroupRequest { + // group is the agent update group name whose state should be forced to `done`. + string group = 1; +} + +message ResetAgentGroupRequest { + // group is the agent update group name whose state should be reset. + string group = 1; +} + +message RollbackAgentGroupRequest { + // group is the agent update group name whose state should change to `rolledback`. + string group = 1; +} +``` + +### Backend logic to progress the rollout + +#### Rollout strategies + +We support two rollout strategies, for two distinct use-cases: + +- `halt-on-failure` for damage reduction of a faulty update +- `time-based` for time-constrained maintenances + +In `halt-on-failure`, the update proceeds from the first group to the last group, ensuring that each group +successfully updates before allowing the next group to proceed. By default, only 5 agent groups are allowed. This +mitigates very long rollout plans. This is the strategy that offers the best availability. A group finishes its update +once most of its agents are running the correct version. Agents that missed the group update will try to catch +back as soon as possible. + +In `time-based` maintenances, agents update as soon as their maintenance window starts. There is no dependency +between groups. This strategy allows Teleport users to setup reliable follow-the-sun updates and enforce the +maintenance window more strictly. A group finishes its update at the end of the maintenance window, regardless +of the new version adoption rate. Agents that missed the maintenance window will not attempt to +update until the next maintenance window. + +After phase 6, a third strategy, `backpressure` will be added. This strategy will behave the same way `halt-on-failure` +does, except the agents will be progressively rolled-out within a group. + +#### Agent update mode + +The agent auto update mode is specified by both Cloud (via `autoupdate_version`) +and by the customer (via `autoupdate_config`). The agent update mode controls whether +the cluster in enrolled into automatic agent updates. + +The agent update mode can take 3 values: + +1. disabled: teleport should not manage agent updates +2. paused: the updates are temporarily suspended, we honour the existing rollout state +3. enabled: teleport can update agents + +The cluster agent rollout mode is computed by taking the lowest value. +For example: + +- Cloud says `enabled` and the customer says `enabled` -> the updates are `enabled` +- Cloud says `enabled` and the customer says `suspended` -> the updates are `suspended` +- Cloud says `disabled` and the customer says `suspended` -> the updates are `disabled` +- Cloud says `disabled` and the customer says `enabled` -> the updates are `disabled` + +The Teleport cluster only progresses the rollout if the mode is `enabled`. + +#### Group States + +Let `v1` be the previous version and `v2` the target version. + +A group can be in 5 states: +- `unstarted`: the group update has not been started yet. +- `canary`: a few canaries are getting updated. New agents should run `v1`. Existing agents should not attempt to update + and keep their existing version. +- `active`: the group is actively getting updated. New agents should run `v2`, existing agents are instructed to update + to `v2`. +- `done`: the group has been updated. New agents should run `v2`. +- `rolledback`: the group has been rolledback. New agents should run `v1`, existing agents should update to `v1`. + +The finite state machine for the `halt-on-failure` is the following: + +```mermaid +flowchart TD + unstarted((unstarted)) + canary((canary)) + active((active)) + done((done)) + rolledback((rolledback)) + + unstarted -->|TriggerGroupRPC
Start conditions are met| canary + canary -->|Canaries came back alive| active + canary -->|ForceGroupRPC| done + canary -->|RollbackGroupRPC| rolledback + active -->|ForceGroupRPC
Success criteria met| done + done -->|RollbackGroupRPC| rolledback + active -->|RollbackGroupRPC| rolledback + + canary -->|ResetGroupRPC| canary + active -->|ResetGroupRPC| active +``` + +The finite state machine for the `time-based` is the following: +```mermaid +flowchart TD + unstarted((unstarted)) + canary((canary)) + active((active)) + done((done)) + rolledback((rolledback)) + + unstarted -->|TriggerGroupRPC
Start conditions are met| canary + canary -->|Canaries came back alive and window is still active| active + canary -->|ForceGroupRPC
Canaries came back alive and window is over| done + canary -->|RollbackGroupRPC| rolledback + active -->|ForceGroupRPC
End of window| done + done -->|Beginning of window| active + done -->|RollbackGroupRPC| rolledback + active -->|RollbackGroupRPC| rolledback + + canary -->|ResetGroupRPC| canary +``` + + +> [!NOTE] +> Once we have a proper feedback mechanism (phase 5) we might introduce a new `unfinished` state, similar to done, but +> which indicates that not all agents got updated when using the `time-based` strategy. This does not change the update +> logic but might be clearer for the end user. + +#### Starting a group + +A group can be started if the following criteria are met +- for the `halt-on-failure` strategy: + - all of its previous group are in the `done` state + - it has been at least `wait_days` since the previous group update started + - the current week day is in the `days` list + - the current hour equals the `hour` field +- for the `time-based` strategy: + - the current week day is in the `days` list + - the current hour equals the `hour` field + +When all those criteria are met, the auth will transition the group into a new state. +If `canary_count` is not null, the group transitions to the `canary` state. +Else it transitions to the `active` state. + +In phase 4, at the start of a group rollout, the Teleport auth servers record the initial number connected agents. +The number of updated and non-updated agents is tracked by the auth servers. This will be used later to evaluate the +update success criteria. + +#### Canary testing (phase 5) + +A group in `canary` state will be randomly assigned `canary_count` canary agents. +Auth servers will select those canaries by reading them from the auth instance inventory and writing them to the `canaries` list in `agent_rollout_plan` status. +The proxies will instruct those canaries to update immediately. +During each reconciliation loop, the auth will lookup the instance heartbeat of each canary in the backend and update `agent_rollout_plan` status if needed. + +Once all canaries have a heartbeat containing the new version (the heartbeat must not be older than 20 minutes), +they successfully came back online and the group can transition to the `active` state. + +If canaries never update, report rollback, or disappear, the group will stay stuck in `canary` state. +An alert will eventually fire, warning the user about the stuck update. + +> [!NOTE] +> In the first version, canary selection will happen randomly. As most instances are running the ssh_service and not +> the other ones, we are less likely to catch an issue in a less common service. +> An optimisation would be to try to pick canaries maximizing the service coverage. +> This would make the test more robust and provide better availability guarantees. + +#### Updating a group + +A group in `active` mode is currently being updated. The conditions to leave `active` mode and transition to the +`done` mode will vary based on the phase and rollout strategy. + +- for the `halt-on-failure` strategy: + - Phase 3: we don't have any information about agents. The group transitions to `done` 60 minutes after its start. + - Phase 4: we know about the connected agent count and the connected agent versions. The group transitions to `done` if: + - at least `(100 - max_in_flight)%` of the agents are still connected + - at least `(100 - max_in_flight)%` of the agents are running the new version + - Phase 6: we incrementally update the progress, this adds a new criteria: the group progress is at 100% +- for the `time-based` strategy: + - the group transitions to the `done` state `maintenance_window_minutes` minutes after the `active` transition. + The rollout's `start_time` must be used to do this transition, not the schedule's `start_hour`. + This will allow the user to trigger manual out-of-maintenance updates if needed. + +The phase 6 backpressure calculations are covered in the Backpressure Calculations section below.. + +### Manually interacting with the rollout + +For user: +```shell +tctl autoupdate agent suspend/resume +tctl autoupdate agent enable/disable + +tctl autoupdate agent status +tctl autoupdate agent status + +tctl autoupdate agent start [--no-canary] +tctl autoupdate agent force +tctl autoupdate agent reset + +tctl autoupdate agent rollback [|--all] +``` + +For admin +```shell +tctl autoupdate agent-plan target [--previous-version ] +tctl autoupdate agent-plan enable/disable +tctl autoupdate agent-plan suspend/resume +``` + +### Editing the plan + +The updater will receive `agent_auto_update: true` from the time is it designated for update until the `target_version` in `autoupdate_version` (below) changes. +Changing the `target_version` resets the schedule immediately, clearing all progress. + +[TODO: What is the use-case for this? can we do like with target_version and reset all instead of trying to merge the state] +Changing the `start_version` in `autoupdate_version` changes the advertised `start_version` for all unfinished groups. + +Changing `agent_schedules` will preserve the `state` of groups that have the same name before and after the change. +However, any changes to `agent_schedules` that occur while a group is active will be rejected. + +Releasing new agent versions multiple times a week has the potential to starve dependent groups from updates. + +Note that the `default` group applies to agents that do not specify a group name. +If a `default` group is not present, the last group is treated as the default. + +### Updater APIs + +#### Update requests + +Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. +The version served from that endpoint will be configured using new `autoupdate_version` resource. + +Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_auto_update` field) is +dependent on: +- The `host=[uuid]` parameter sent to `/v1/webapi/find` +- The `group=[name]` parameter sent to `/v1/webapi/find` +- The group state from the `autoupdate_agent_rollout` status (this also contains the version from `autoupdate_version`) + +To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via +unauthenticated requests to `/v1/webapi/find`. Teleport proxies modulate the `/v1/webapi/find` response given the host +UUID and group name. + +When the updater queries the proxy via `/v1/webapi/find?host=[uuid]&group=[name]`, the proxies query the +`autoupdate_agent_rollout` to determine the value of `agent_auto_update: true`. +The boolean is returned as `true` in the case that the provided `host` contains a UUID that is under the progress +percentage for the `group`: +`as_numeral(host_uuid) / as_numeral(max_uuid) < progress` + +The returned JSON looks like: + +`/v1/webapi/find?host=[uuid]&group=[name]` +```json +{ + "server_edition": "enterprise", + "auto_update": { + "agent_version": "15.1.1", + "agent_auto_update": true, + "agent_update_jitter_seconds": 10 + }, + // ... +} +``` + +Notes: + +- Agents will only update if `agent_auto_update` is `true`, but new installations will use `agent_version` regardless of + the value in `agent_auto_update`. +- The edition served is the cluster edition (enterprise, enterprise-fips, or oss) and cannot be configured. +- The group name is read from `/var/lib/teleport/versions/update.yaml` by the updater. +- The UUID is read from `/tmp/teleport_update_uuid`, which `teleport-update` regenerates when missing. +- the jitter is served by the teleport cluster and depends on the rollout strategy (60s by default, 10s when using + the backpressure strategy). + +Let `v1` be the previous version and `v2` the target version, the response matrix is the following: + +##### Rollout status: disabled + +| Group state | Version | Should update | +|-------------|---------|---------------| +| * | v2 | false | + +##### Rollout status: paused + +| Group state | Version | Should update | +|-------------|---------|---------------| +| unstarted | v1 | false | +| canary | v1 | false | +| active | v2 | false | +| done | v2 | false | +| rolledback | v1 | false | + +##### Rollout status: enabled + +| Group state | Version | Should update | +|-------------|---------|--------------------------------------------------| +| unstarted | v1 | false | +| canary | v1 | false, except for canaries | +| active | v2 | true if UUID <= progress | +| done | v2 | true if `halt-on-failure`, false if `time-based` | +| rolledback | v1 | true | + +#### Updater status reporting + +The updater reports status through the agent. The agent has two ways of reporting the update information: +- via instance heartbeats +- via the hello message, when registering against an auth server + +Instance heartbeat happen infrequently, based on the cluster size they can take up to 17 minutes to happen. +However, they are exposed to the user via existing `tctl inventory` method and will allow users to query which instance +is running which version and belongs to which group. + +Hello messages are sent on connection and are used to build the serve's local inventory. +This information is available almost instantaneously after the connection and can be cheaply queried by the auth ( +everything is in memory). The inventory is then used to count the local instances and drive the rollout. + +Both instance heartbeats and Hello merssages will be extended to incorporate and send data that is written to +`/var/lib/teleport/versions/update.yaml` and `/tmp/teleport_update_uuid` by the `teleport-update` binary. + +The following data related to the update is sent by the agent: +- `agent_update_start_time`: timestamp of individual agent's upgrade time +- `agent_update_start_version`: current agent version +- `agent_update_rollback`: whether the agent was rolled-back automatically +- `agent_update_uuid`: Auto-update UUID +- `agent_update_group`: Auto-update group name + +Auth servers use their local instance inventory to calculate rollout statistics and write them to `/autoupdate/[group]/[auth ID]` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56`). + +Every minute, auth servers persist the version counts: +- `agent_data[group].stats[version]` + - `count`: number of currently connected agents at `version` in `group` + - `failed_count`: number of currently connected agents at `version` in `group` that experienced a rollback or inability to upgrade + - `lowest_uuid`: lowest UUID of all currently connected agents at `version` in `group` + - `count`: number of connected agents at `version` in `group` at start of window +- `agent_data[group]` + - `canaries`: list of updater UUIDs to use for canary deployments + +Expiration time of the persisted key is 1 hour. + +To progress the rollout, auth servers will range-read keys from `/autoupdate/[group]/*`, sum the counts, and write back to the `autoupdate_agent_rollout` status on a one-minute interval. +- To calculate the initial number of agents connected at the start of the window, each auth server will write the summed count of agents to `autoupdate_agent_rollout` status, if not already written. +- To calculate the canaries, each auth server will write a random selection of all canaries to `autoupdate_agent_rollout` status, if not already written. +- To determine the progress through the rollout, auth servers will write the calculated progress to the `autoupdate_agent_rollout` status using the formulas, declining to write if the current written progress is further ahead. + +If `/autoupdate/[group]/[auth ID]` is older than 1 minute, we do not consider its contents. +This prevents double-counting agents when auth servers are killed. + +#### Backpressure Calculations + +Given: +``` +initial_count[group] = sum(agent_data[group].stats[*]).count +``` + +Each auth server will calculate the progress as +`( max_in_flight * initial_count[group] + agent_data[group].stats[target_version].count ) / initial_count[group]` and +write the progress to `autoupdate_agent_rollout` status. This formula determines the progress percentage by adding a +`max_in_flight` percentage-window above the number of currently updated agents in the group. + +However, if `as_numeral(agent_data[group].stats[not(target_version)].lowest_uuid) / as_numeral(max_uuid)` is above the +calculated progress, that progress value will be used instead. This protects against a statistical deadlock, where no +UUIDs fall within the next `max_in_flight` window of UUID space, by always permitting the next non-updated agent to +update. + +To ensure that the rollout is halted if more than `max_in_flight` un-updated agents drop off, an addition restriction +must be imposed for the rollout to proceed: +`agent_data[group].stats[*].count > initial_count[group] - max_in_flight * initial_count[group]` + +To prevent double-counting of agents when considering all counts across all auth servers, only agents connected for one +minute will be considered in these formulas. + +### Linux Agents + +We will ship a new auto-updater binary for Linux servers written in Go that does not interface with the system package manager. +It will be distributed within the existing `teleport` packages, and additionally, in a dedicated `teleport-update-vX.Y.Z.tgz` tarball. +It will manage the installation of the correct Teleport agent version manually. + +It will read the unauthenticated `/v1/webapi/find` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified update plan. +It will download the correct version of Teleport as a tarball, unpack it in `/var/lib/teleport`, and ensure it is symlinked from `/usr/local/bin`. + +Source code for the updater will live in the main Teleport repository, with the updater binary built from `tools/teleport-update`. + +#### Installation + +Package-initiated install: +```shell +$ apt-get install teleport +$ teleport-update enable --proxy example.teleport.sh + +# if not enabled already, configure teleport and: +$ systemctl enable teleport +``` + +Packageless install: +```shell +$ curl https://cdn.teleport.dev/teleport-update.tgz | tar xzf +$ ./teleport-update enable --proxy example.teleport.sh + +# if not enabled already, configure teleport and: +$ systemctl enable teleport +``` + +For grouped updates, a group identifier may be configured: +```shell +$ teleport-update enable --proxy example.teleport.sh --group staging +``` + +For air-gapped Teleport installs, the agent may be configured with a custom tarball path template: +```shell +$ teleport-update enable --proxy example.teleport.sh --template 'https://example.com/teleport-{{ .Edition }}-{{ .Version }}-{{ .Arch }}.tgz' +``` +(Checksum will use template path + `.sha256`) + +For Teleport installs with custom data directories, the data directory must be specified on each binary invocation: +```shell +$ teleport-update enable --proxy example.teleport.sh --data-dir /var/lib/teleport +``` + +For managing multiple Teleport installs, the install suffix must be specified on each binary invocation: +```shell +$ teleport-update enable --proxy example.teleport.sh --install-suffix clusterA +``` +This will create suffixed directories for binaries (`/usr/local/teleport/clusterA/bin`) and systemd units (`teleport-clusterA`). + + +#### Filesystem + +For a default install, without --install-suffix: +``` +$ tree /var/lib/teleport +/var/lib/teleport +└── versions + ├── 15.0.0 + │ ├── bin + │ │ ├── tsh + │ │ ├── tbot + │ │ ├── ... # other binaries + │ │ ├── teleport-update + │ │ └── teleport + │ ├── etc + │ │ └── systemd + │ │ └── teleport.service + │ └── backup + │ ├── sqlite.db + │ └── backup.yaml + ├── 15.1.1 + │ ├── bin + │ │ ├── tsh + │ │ ├── tbot + │ │ ├── ... # other binaries + │ │ ├── teleport-update + │ │ └── teleport + │ └── etc + │ └── systemd + │ └── teleport.service + └── update.yaml + +$ ls -l /usr/local/bin/tsh +/usr/local/bin/tsh -> /var/lib/teleport/versions/15.0.0/bin/tsh +$ ls -l /usr/local/bin/tbot +/usr/local/bin/tbot -> /var/lib/teleport/versions/15.0.0/bin/tbot +$ ls -l /usr/local/bin/teleport +/usr/local/bin/teleport -> /var/lib/teleport/versions/15.0.0/bin/teleport +$ ls -l /usr/local/bin/teleport-update +/usr/local/bin/teleport-update -> /var/lib/teleport/versions/15.0.0/bin/teleport-update +$ ls -l /usr/local/lib/systemd/system/teleport.service +/usr/local/lib/systemd/system/teleport.service -> /var/lib/teleport/versions/15.0.0/etc/systemd/teleport.service +``` + +With --install-suffix clusterA: +``` +$ tree /var/lib/teleport/install/clusterA +/var/lib/teleport/install/clusterA +└── versions + ├── 15.0.0 + │ ├── bin + │ │ ├── tsh + │ │ ├── tbot + │ │ ├── ... # other binaries + │ │ ├── teleport-update + │ │ └── teleport + │ ├── etc + │ │ └── systemd + │ │ └── teleport.service + │ └── backup + │ ├── sqlite.db + │ └── backup.yaml + ├── 15.1.1 + │ ├── bin + │ │ ├── tsh + │ │ ├── tbot + │ │ ├── ... # other binaries + │ │ ├── teleport-update + │ │ └── teleport + │ └── etc + │ └── systemd + │ └── teleport.service + └── update.yaml + +/var/lib/teleport +└── versions + ├── system # if installed via OS package + ├── bin + │ ├── tsh + │ ├── tbot + │ ├── ... # other binaries + │ ├── teleport-update + │ └── teleport + └── etc + └── systemd + └── teleport.service + +$ ls -l /usr/local/bin/tsh +/usr/local/teleport/clusterA/bin/tsh -> /var/lib/teleport/install/clusterA/versions/15.0.0/bin/tsh +$ ls -l /usr/local/bin/tbot +/usr/local/teleport/clusterA/bin/tbot -> /var/lib/teleport/install/clusterA/versions/15.0.0/bin/tbot +$ ls -l /usr/local/bin/teleport +/usr/local/teleport/clusterA/bin/teleport -> /var/lib/teleport/install/clusterA/versions/15.0.0/bin/teleport +$ ls -l /usr/local/bin/teleport-update +/usr/local/teleport/clusterA/bin/teleport-update -> /var/lib/teleport/install/clusterA/versions/15.0.0/bin/teleport-update +$ ls -l /usr/local/lib/systemd/system/teleport-clusterA.service +/usr/local/lib/systemd/system/teleport-clutserA.service -> /var/lib/teleport/install/clusterA/versions/15.0.0/etc/systemd/teleport.service +``` + +##### update.yaml + +This file stores configuration for `teleport-update`. + +All updates are applied atomically using renameio. + +``` +version: v1 +kind: update_config +spec: + # proxy specifies the Teleport proxy address to retrieve the agent version and update configuration from. + proxy: mytenant.teleport.sh + # group specifies the update group + group: staging + # url_template specifies a custom URL template for downloading Teleport. + # url_template: "" + # enabled specifies whether auto-updates are enabled, i.e., whether teleport-update update is allowed to update the agent. + enabled: true +status: + # start_time specifies the start time of the most recent update. + start_time: 2020-12-09T16:09:53+00:00 + # active_version specifies the active (symlinked) deployment of the teleport agent. + active_version: 15.1.1 + # version_history specifies the previous deployed versions, in order by recency. + version_history: ["15.1.3", "15.0.4"] + # rollback specifies whether the most recent version was deployed by an automated rollback. + rollback: true + # error specifies the last error encounted + error: "" +``` + +##### backup.yaml + +This file stores metadata about an individual backup of the Teleport agent's sqlite DB. + +``` +version: v1 +kind: db_backup +spec: + # proxy address from the backup + proxy: mytenant.teleport.sh + # version from the backup + version: 15.1.0 + # time the backup was created + creation_time: 2020-12-09T16:09:53+00:00 +``` + +#### Runtime + +The `teleport-update` binary will run as a periodically executing systemd service which runs every 10 minutes. +The systemd service will run: +```shell +$ teleport-update update +``` + +After it is installed, the `update` subcommand will no-op when executed until configured with the `teleport-update` command: +```shell +$ teleport-update enable --proxy mytenant.teleport.sh --group staging +``` + +If the proxy address is not provided with `--proxy`, the current proxy address from `teleport.yaml` is used, if present. + +The `enable` subcommand will change the behavior of `teleport-update update` to update teleport and restart the existing agent, if running. +It will also run update teleport immediately, to ensure that subsequent executions succeed. + +Both `update` and `enable` will maintain a shared lock file preventing any re-entrant executions. + +The `enable` subcommand will: +1. If an updater-incompatible version of the Teleport package is installed, fail immediately. +2. Query the `/v1/webapi/find` endpoint. +3. If the current updater-managed version of Teleport is the latest, jump to (15). +4. Ensure there is enough free disk space to update Teleport via `unix.Statfs()` and `content-length` header from `HEAD` request. +5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. +6. Download and verify the checksum (tarball URL suffixed with `.sha256`). +7. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. +8. Verify that the downloaded binaries are valid executables on the host. +9. Replace any existing binaries or symlinks with symlinks to the current version. +10. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. +11. Restart the agent if the systemd service is already enabled. +12. Set `active_version` in `update.yaml` if successful or not enabled. +13. Replace the symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. +14. Remove all stored versions of the agent except the current version and last working version. +15. Configure `update.yaml` with the current proxy address and group, and set `enabled` to true. + +The `disable` subcommand will: +1. Configure `update.yaml` to set `enabled` to false. + +When `update` subcommand is otherwise executed, it will: +1. Check `update.yaml`, and quit (exit 0) if `enabled` is false, or quit (exit 1) if `enabled` is true and no proxy address is set. +2. Query the `/v1/webapi/find` endpoint. +3. Check that `agent_auto_updates` is true, quit otherwise. +4. If the current version of Teleport is the latest, quit. +5. Wait `random(0, agent_update_jitter_seconds)` seconds. +6. Ensure there is enough free disk space to update Teleport via `unix.Statfs()` and `content-length` header from `HEAD` request. +7. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. +8. Download and verify the checksum (tarball URL suffixed with `.sha256`). +9. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. +10. Verify that the downloaded binaries are valid executables on the host. +11. Update symlinks to point at the new version. +12. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. +13. Restart the agent if the systemd service is already enabled. +14. Set `active_version` in `update.yaml` if successful or not enabled. +15. Replace the old symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. +16. Remove all stored versions of the agent except the current version and last working version. + +To guarantee auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-update` at that version if present and different. +The `/usr/local/bin/teleport-update` symlink will take precedence to avoid reexec in most scenarios. + +To ensure that SELinux permissions do not prevent the `teleport-update` binary from installing/removing Teleport versions, the updater package will configure SELinux contexts to allow changes to all required paths. + +To ensure that backups are consistent, the updater will use the [SQLite backup API](https://www.sqlite.org/backup.html) to perform the backup. + +The `teleport` apt and yum packages will contain a system installation of Teleport in `/usr/local/teleport-system/`. +Post package installation, the `link` subcommand is executed automatically to link the system installation when no auto-updater-managed version of Teleport is linked: +``` +/usr/local/bin/teleport -> /usr/local/teleport-system/bin/teleport +/usr/local/bin/teleport-update -> /usr/local/teleport-system/bin/teleport-update +... +``` + +#### Failure Conditions + +If the new version of Teleport fails to start, the installation of Teleport is reverted as described above. + +If `teleport-update` itself fails with an error, and an older version of `teleport-update` is available, the update will retry with the older version. + +If the agent losses its connection to the proxy, `teleport-update` updates the agent to the group's current desired version immediately. + +Known failure conditions caused by intentional configuration (e.g., updates disabled) will not trigger retry logic. + +#### Status + +To retrieve known information about agent updates, the `status` subcommand will return the following: +```json +{ + "agent_version_installed": "15.1.1", + "agent_version_desired": "15.1.2", + "agent_version_previous": "15.1.0", + "agent_update_time_last": "2020-12-10T16:00:00+00:00", + "agent_update_time_jitter": 600, + "agent_updates_enabled": true +} +``` + +### Downgrades + +Downgrades may be necessary in cases where we have rolled out a bug or security vulnerability with critical impact. +To initiate a downgrade, `agent_version` is set to an older version than it was previously set to. + +Downgrades are challenging, because `sqlite.db` used by newer version of Teleport may not be valid for older versions of Teleport. + +When Teleport is downgraded to a previous version that has a backup of `sqlite.db` present in `/var/lib/teleport/versions/OLD-VERSION/backup/`: +1. `/var/lib/teleport/versions/OLD-VERSION/backup/backup.yaml` is validated to determine if the backup is usable (proxy and version must match, age must be less than cert lifetime, etc.) +2. If the backup is valid, Teleport is fully stopped, the backup is restored along with symlinks, and the downgraded version of Teleport is started. +3. If the backup is invalid, we refuse to downgrade. + +Downgrades are applied with `teleport-update update`, just like upgrades. +The above steps modulate the standard workflow in the section above. +If the downgraded version is already present, the uncompressed version is used to ensure fast recovery of the exact state before the failed upgrade. +To ensure that the target version is was not corrupted by incomplete extraction, the downgrade checks for the existence of `/var/lib/teleport/versions/TARGET-VERSION/sha256` before downgrading. +To ensure that the DB backup was not corrupted by incomplete copying, the downgrade checks for the existence of `/var/lib/teleport/versions/TARGET-VERSION/backup/backup.yaml` before restoring. + +Teleport must be fully-stopped to safely replace `sqlite.db`. +When restarting the agent during an upgrade, `SIGHUP` is used. +When restarting the agent during a downgrade, `systemd stop/start` are used before/after the downgrade. + +Teleport CA certificate rotations will break rollbacks. +In the future, this could be addressed with additional validation of the agent's client certificate issuer fingerprints. +For now, rolling forward will allow recovery from a broken rollback. + +Given that rollbacks may fail, we must maintain the following invariants: +1. Broken rollbacks can always be reverted by reversing the rollback exactly. +2. Broken versions can always be reverted by rolling back and then skipping the broken version. + +When rolling forward, the backup of the newer version's `sqlite.db` is only restored if that exact version is the roll-forward version. +Otherwise, the older, rollback version of `sqlite.db` is preserved (i.e., the newer version's backup is not used). +This ensures that a version update which broke the database can be recovered with a rollback and a new patch. +It also ensures that a broken rollback is always recoverable by reversing the rollback. + +Example: Given v1, v2, v3 versions of Teleport, where v2 is broken: +1. v1 -> v2 -> v1 -> v3 => DB from v1 is migrated directly to v3, avoiding v2 breakage. +2. v1 -> v2 -> v1 -> v2 -> v3 => DB from v2 is recovered, in case v1 database no longer has a valid certificate. + +### Manual Workflow + +For use cases that fall outside of the functionality provided by `teleport-update`, we provide an alternative manual workflow using the `/v1/webapi/find` endpoint. +This workflow supports customers that cannot use the auto-update mechanism provided by `teleport-update` because they use their own automation for updates (e.g., JamF or Ansible). + +Cluster administrators that want to self-manage agent updates may manually query the `/v1/webapi/find` endpoint using the host UUID, and implement auto-updates with their own automation. + +Cluster administrators that choose this path may use the `teleport` package without auto-updates enabled locally. + +### Installers + +The following install scripts will be updated to install the latest updater and run `teleport-update enable` with the proxy address: +- [/api/types/installers/agentless-installer.sh.tmpl](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/agentless-installer.sh.tmpl) +- [/api/types/installers/installer.sh.tmpl](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/installer.sh.tmpl) +- [/lib/web/scripts/oneoff/oneoff.sh](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/lib/web/scripts/oneoff/oneoff.sh) +- [/lib/web/scripts/node-join/install.sh](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/lib/web/scripts/node-join/install.sh) +- [/assets/aws/files/install-hardened.sh](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/assets/aws/files/install-hardened.sh) + +Eventually, additional logic from the scripts could be added to `teleport-update`, such that `teleport-update` can configure teleport. + +Moving additional logic into the updater is out-of-scope for this proposal. + +To create pre-baked VM or container images that reduce the complexity of the cluster joining operation, two workflows are permitted: +- Install the `teleport` package and defer `teleport-update enable`, Teleport configuration, and `systemctl enable teleport` to cloud-init scripts. + This allows both the proxy address and token to be injected at VM initialization. The VM image may be used with any Teleport cluster. + Installers scripts will continue to function, as the package install operation will no-op. +- Install the `teleport` package and run `teleport-update enable` before the image is baked, but defer final Teleport configuration and `systemctl enable teleport` to cloud-init scripts. + This allows the proxy address to be pre-set in the image. `teleport.yaml` can be partially configured during image creation. At minimum, the token must be injected via cloud-init scripts. + Installers scripts would be skipped in favor of the `teleport configure` command. + +It is possible for a VM or container image to be created with a baked-in join token. +We should recommend against this workflow for security reasons, since a long-lived token improperly stored in an image could be leaked. + +Alternatively, users may prefer to skip pre-baked agent configuration, and run one of the script-based installers to join VMs to the cluster after the VM is started. + +Documentation should be created covering the above workflows. + +### Documentation + +The following documentation will need to be updated to cover the new updater workflow: +- https://goteleport.com/docs/choose-an-edition/teleport-cloud/downloads +- https://goteleport.com/docs/installation +- https://goteleport.com/docs/upgrading/self-hosted-linux +- https://goteleport.com/docs/upgrading/self-hosted-automatic-agent-updates + +Additionally, the Cloud dashboard tenants downloads tab will need to be updated to reference the new instructions. + +### Details - Kubernetes Agents + +The Kubernetes agent updater will be updated for compatibility with the new scheduling system. + +This means that it will stop reading update windows using the authenticated connection to the proxy, and instead update when indicated by the `/v1/webapi/find` endpoint. + +Rollbacks for the Kubernetes updater, as well as packaging changes to improve UX and compatibility, will be covered in a future RFD. + +## Migration + +The existing update system will remain in-place until the old auto-updater is fully deprecated. + +Both update systems can co-exist on the same machine. +The old auto-updater will update the system package, which will not affect the `teleport-update`-managed installation. + +Eventually, the `cluster_maintenance_config` resource and `teleport-ent-upgrader` package will be deprecated. + +## Security + +The initial version of automatic updates will rely on TLS to establish +connection authenticity to the Teleport download server. The authenticity of +assets served from the download server is out of scope for this RFD. Cluster +administrators concerned with the authenticity of assets served from the +download server can use self-managed updates with system package managers which +are signed. + +The Update Framework (TUF) will be used to implement secure updates in the future. + +Anyone who possesses an updater UUID can determine when that host is scheduled to update by repeatedly querying the public `/v1/webapi/find` endpoint. +It is not possible to discover the current version of that host, only the designated update window. + +## Logging + +All installation steps will be logged locally, such that they are viewable with `journalctl`. +Care will be taken to ensure that updater logs are sharable with Teleport Support for debugging and auditing purposes. + +When TUF is added, that events related to supply chain security may be sent to the Teleport cluster via the Teleport Agent. + +## Alternatives + +### `teleport update` Subcommand + +`teleport-update` is intended to be a minimal binary, with few dependencies, that is used to bootstrap initial Teleport agent installations. +It may be baked into AMIs or containers. + +If the entire `teleport` binary were used instead, security scanners would match vulnerabilities all Teleport dependencies, so customers would have to handle rebuilding artifacts (e.g., AMIs) more often. +Deploying these updates is often more disruptive than a soft restart of the agent triggered by the auto-updater. + +`teleport-update` will also handle `tbot` updates in the future, and it would be undesirable to distribute `teleport` with `tbot` just to enable automated updates. + +Finally, `teleport-update`'s API contract with the cluster must remain stable to ensure that outdated agent installations can always be recovered. +The first version of `teleport-update` will need to work with Teleport v14 and all future versions of Teleport. +This contract may be easier to manage with a separate artifact. + +### Mutually-Authenticated RPC for Update Boolean + +Agents will not always have a mutually-authenticated connection to auth to receive update instructions. +For example, the agent may be in a failed state due to a botched upgrade, may be temporarily stopped, or may be newly installed. +In the future, `tbot`-only installations may have expired certificates. + +Making the update boolean instruction available via the `/webapi/find` TLS endpoint reduces complexity as well as the risk of unrecoverable outages. + +## Execution Plan + +1. Implement Teleport APIs for new scheduling system (without backpressure strategy, canaries, or completion tracking) +2. Implement new Linux server auto-updater in Go, including systemd-based rollbacks. +3. Implement changes to Kubernetes auto-updater. +4. Test extensively on all supported Linux distributions. +5. Prep documentation changes. +6. Release via `teleport` package and script for package-less installation. +7. Release documentation changes. +8. Communicate to users that they should update to the new system. +9. Begin deprecation of old auto-updater resources, packages, and endpoints. +10. Add healthcheck endpoint to Teleport agents and incorporate into rollback logic. +11. Add progress and completion checking. +12. Add canary functionality. +13. Add backpressure functionality if necessary. +14. Add DB backups if necessary.