Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve robustness and auditing of discovery service #37620

Closed
19 tasks done
r0mant opened this issue Jan 31, 2024 · 1 comment
Closed
19 tasks done

Improve robustness and auditing of discovery service #37620

r0mant opened this issue Jan 31, 2024 · 1 comment
Assignees
Labels
bug discover Issues related to Teleport Discover ui ux

Comments

@r0mant
Copy link
Collaborator

r0mant commented Jan 31, 2024

Description

When performing SSH auto-discovery on cloud providers, discovery service automatically installs Teleport on the discovered instances. It works in the successful scenario however when something fails (e.g. we fail to install Teleport package via SSM on EC2 instance) the experience is subpar:

  • There is no good indication that some nodes failed to join and which ones and why.
  • There is no clear way for a user to find out what failed and where to go to troubleshoot.
  • There is no easy way for a user to retry the operation.
  • Oftentimes Teleport would keep trying to reinstall and failing and it can't self-repair.

Proposed solution(-s)

In the order of the required effort, I think these are the steps that we need to take to improve the reliability and observability of auto-discovery, focused specifically on SSH for now since that seems to be most fragile at the moment.

Improve audit logging

As a cheap first step, let's make sure that all successes and failures of the discovery service are captured in the audit log. I think we do have audit logs for it but let's reinspect them and make sure they're useful and in case of failure contain all required information for a user about:

  • Which node installation failed on.
  • Reason for the failure (e.g. install script error).
  • Link to the relevant part of Cloud Console for troubleshooting (e.g. SSM execution logs for EC2).

Make sure auto-discovery install/upgrade scripts can handle small problems

We should identify most common failure scenarios and make sure that the auto-discovery system and the install script can overcome them gracefully upon retry or once the issue has been resolved. Some examples that come to mind (but not exhaustive):

  • Lack of IAM permissions (should also be clearly evident from audit log, to the point above).
  • Installation failed part-way due to external issues (like, repos, networking, etc.).
  • Installation failed due to bad config.

The system should detect that it failed to install before (as opposed to, for example, this being an installation done by something else) and take appropriate actions like uninstall and start from scratch.

Make use of notification system

Auto-enrollment failures should be captured in our new notification system.

Build auto-discovery dashboard

( see #41909 )
I think this is what will be most useful for users but also require most engineering effort and design support but the idea is to have a dedicated dashboard that shows auto-discovery status where users can see/do things like:

  • See whether auto-discovery is enabled, turn it on/off and update discovery config.
  • See relevant auto-discovery status, for example nodes that fail to enroll and errors explaining why.
  • Manually trigger "retry" (or "fix") on the nodes that failed.

Related issues:
#31180

Tasks

Preview Give feedback
  1. backport/branch/v15 discovery documentation no-changelog size/md
  2. backport/branch/v15 no-changelog size/sm
  3. backport/branch/v15 no-changelog size/sm
  4. backport/branch/v13 backport/branch/v14 backport/branch/v15 size/sm
  5. backport/branch/v15 size/md
  6. backport/branch/v15 size/sm
  7. backport/branch/v15 size/sm
  8. backport/branch/v15 documentation no-changelog size/md
  9. backport/branch/v15 backport/branch/v16 size/md
  10. backport/branch/v16 size/sm
  11. backport/branch/v16 no-changelog size/md

Replace installation script with go code

Preview Give feedback
  1. backport/branch/v15 backport/branch/v16 no-changelog size/sm
  2. backport/branch/v15 backport/branch/v16 no-changelog size/sm
  3. backport/branch/v15 backport/branch/v16 no-changelog size/sm
  4. backport/branch/v15 backport/branch/v16 no-changelog size/sm
  5. backport/branch/v15 backport/branch/v16 no-changelog size/sm
  6. backport/branch/v15 backport/branch/v16 no-changelog size/sm
  7. backport/branch/v16 no-changelog size/xl
  8. backport/branch/v16 no-changelog size/sm
@zmb3
Copy link
Collaborator

zmb3 commented Dec 31, 2024

Everything in this tracking issue is marked complete and it hasn't been updated for 5 months so I'm assuming it's safe to close.

@zmb3 zmb3 closed this as completed Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug discover Issues related to Teleport Discover ui ux
Projects
None yet
Development

No branches or pull requests

3 participants