Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve messages when EC2 Auto Discover with SSM fails #41465

Merged
merged 3 commits into from
May 16, 2024

Conversation

marcoandredinis
Copy link
Contributor

@marcoandredinis marcoandredinis commented May 13, 2024

EC2 Auto Discover calls ssm:SendCommand to install teleport in a set of EC2 Instances.
This requires that the SSM Agent to be running and reporting back to the AWS SSM Service.

This PR adds a new API call which is used to query the current status of the SSM agent in the target EC2 instance.

If the agent did not register, is not currently online or the EC2 instance is running an unsupported operating system, an error is reported.
The specific error is returned and the user can see this in the Audit Log.


Example:

Missing IAM permissions to connect to SSM
Before: log message
After: audit log with details (ssm agent did not register itself in AWS SSM Service)

{
  "account_id": "278576220453",
  "cluster_name": "lenix",
  "code": "TDS00W",
  "command_id": "no-command",
  "ei": 0,
  "event": "ssm.run",
  "exit_code": -1,
  "instance_id": "i-0359a5863acbe5e3e",
  "region": "eu-west-2",
  "status": "EC2 Instance is not registered in SSM. Make sure that the instance has AmazonSSMManagedInstanceCore policy assigned.",
  "time": "2024-05-13T14:57:05.407Z",
  "uid": "89951c40-d9c9-42d2-9032-a927fb52ac45"
}

SSM ran but is now unhealthy/connection lost
Before: audit log with status:failed
After: audit log with details (connection lost)

{
  "account_id": "278576220453",
  "cluster_name": "lenix",
  "code": "TDS00W",
  "command_id": "no-command",
  "ei": 0,
  "event": "ssm.run",
  "exit_code": -1,
  "instance_id": "i-0658ae6f2c7f7c142",
  "region": "eu-west-2",
  "status": "SSM Agent in EC2 Instance is not connecting to SSM Service. Restart or reinstall the SSM service. See https://docs.aws.amazon.com/systems-manager/latest/userguide/ami-preinstalled-agent.html#verify-ssm-agent-status for more details.",
  "time": "2024-05-13T14:57:05.611Z",
  "uid": "ff89e4b7-19f7-4431-a15c-96c067c94ee2"
}

instance is running Windows
Before: audit log with status:success
After: audit log with details (unsupported OS)

{
  "account_id": "278576220453",
  "cluster_name": "lenix",
  "code": "TDS00W",
  "command_id": "no-command",
  "ei": 0,
  "event": "ssm.run",
  "exit_code": -1,
  "instance_id": "i-0c0d55776004b8f6e",
  "region": "eu-west-2",
  "status": "EC2 instance is running an unsupported Operating System. Only Linux is supported.",
  "time": "2024-05-13T14:57:05.852Z",
  "uid": "343265de-9c56-49d9-8732-2384f66cf4b4"
}

If any other error happens, it will still be reported in the generic handler for the SendCommand API call.

Given this is a new API call, if the IAM role does not allow it, a log warning is emitted and the behavior is the same as before.

Related #37620

@marcoandredinis marcoandredinis added no-changelog Indicates that a PR does not require a changelog entry backport/branch/v15 labels May 13, 2024
Copy link

🤖 Vercel preview here: https://docs-bzfxcs14l-goteleport.vercel.app/docs/ver/preview

Copy link

🤖 Vercel preview here: https://docs-buat22hz6-goteleport.vercel.app/docs/ver/preview

@marcoandredinis marcoandredinis changed the title Improve unavailable messages when EC2 Auto Discover with SSM fails Improve messages when EC2 Auto Discover with SSM fails May 13, 2024
lib/srv/server/ssm_install.go Outdated Show resolved Hide resolved
lib/srv/server/ssm_install.go Outdated Show resolved Hide resolved
lib/srv/server/ssm_install.go Outdated Show resolved Hide resolved
lib/srv/server/ssm_install.go Outdated Show resolved Hide resolved
lib/srv/server/ssm_install.go Outdated Show resolved Hide resolved
lib/srv/server/ssm_install.go Outdated Show resolved Hide resolved
lib/srv/server/ssm_install.go Outdated Show resolved Hide resolved
lib/srv/server/ssm_install.go Outdated Show resolved Hide resolved
lib/srv/server/ssm_install.go Show resolved Hide resolved
lib/srv/server/ssm_install_test.go Show resolved Hide resolved
Copy link

🤖 Vercel preview here: https://docs-hbz9tnp7v-goteleport.vercel.app/docs/ver/preview

Comment on lines 219 to 220
// AWS returns an error if MaxResults is less than 5.
MaxResults: aws.Int64(max(defaultMaxResults, int64(len(allInstanceIDs)))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max is 50 though. I think we should just use the default and iterate through NextToken

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll use awsEC2APIChunkSize instead and add a comment explaining why we shouldn't change its value.

@public-teleport-github-review-bot public-teleport-github-review-bot bot removed the request for review from xinding33 May 16, 2024 15:01
@marcoandredinis marcoandredinis force-pushed the marco/ec2-ssm-describeinstancesinfo branch from 37e2efb to 0517d72 Compare May 16, 2024 16:03
EC2 Auto Discover calls ssm:SendCommand to install teleport in a set of
EC2 Instances.
This requires that the SSM Agent to be running and reporting back to the
AWS SSM Service.

This PR adds a new API call which is used to query the current status of
the SSM agent in the target EC2 instance.

If the agent did not register, is not currently online or the EC2
instance is running an unsupported operating system, an error is
reported.
The specific error is returned and the user can see this in the Audit
Log.

As an example, let's say we have 3 instances:
- i-A: missing IAM permissions to connect to SSM
- i-B: SSM ran but is now unhealthy
- i-C: instance is running Windows

Previously we had the following observable output after running the
Discovery Service:
i-A (missing iam permissions)
Log message with stack trace indicating that "instance is not valid for
account" with link for further troubleshoot.
No audit log was emitted

i-B (SSM is unhealthy)
No app log, but audit log with status:failed and exit_code:-1

i-C (windows instance)
No app log, but audit log with status:success and exit_code:0

After this PR, the following is reported:
i-A (missing iam permissions)
No app log
Audit log with a clear status message (see code/tests)

i-B (SSM is unhealthy)
No app log
Audit log with a clear status message (see code/tests)

i-C (windows instance)
No app log
Audit log with a clear status message (see code/tests)

If any other error happens, it will still be reported in the generic
handler for the SendCommand API call.

Given this is a new API call, if the Role does not allow it, a log
warning is emitted and the behavior is the same as before.
@marcoandredinis marcoandredinis force-pushed the marco/ec2-ssm-describeinstancesinfo branch from 0517d72 to 6b59009 Compare May 16, 2024 16:04
Copy link

🤖 Vercel preview here: https://docs-4pri0pvm4-goteleport.vercel.app/docs/ver/preview

Copy link

🤖 Vercel preview here: https://docs-g91udg0nt-goteleport.vercel.app/docs/ver/preview

@marcoandredinis marcoandredinis enabled auto-merge May 16, 2024 16:17
@marcoandredinis marcoandredinis added this pull request to the merge queue May 16, 2024
Merged via the queue into master with commit cb6b769 May 16, 2024
40 checks passed
@marcoandredinis marcoandredinis deleted the marco/ec2-ssm-describeinstancesinfo branch May 16, 2024 16:45
@public-teleport-github-review-bot

@marcoandredinis See the table below for backport results.

Branch Result
branch/v15 Failed

marcoandredinis added a commit that referenced this pull request May 17, 2024
* Improve unavailable messages when EC2 Auto Discover with SSM fails

EC2 Auto Discover calls ssm:SendCommand to install teleport in a set of
EC2 Instances.
This requires that the SSM Agent to be running and reporting back to the
AWS SSM Service.

This PR adds a new API call which is used to query the current status of
the SSM agent in the target EC2 instance.

If the agent did not register, is not currently online or the EC2
instance is running an unsupported operating system, an error is
reported.
The specific error is returned and the user can see this in the Audit
Log.

As an example, let's say we have 3 instances:
- i-A: missing IAM permissions to connect to SSM
- i-B: SSM ran but is now unhealthy
- i-C: instance is running Windows

Previously we had the following observable output after running the
Discovery Service:
i-A (missing iam permissions)
Log message with stack trace indicating that "instance is not valid for
account" with link for further troubleshoot.
No audit log was emitted

i-B (SSM is unhealthy)
No app log, but audit log with status:failed and exit_code:-1

i-C (windows instance)
No app log, but audit log with status:success and exit_code:0

After this PR, the following is reported:
i-A (missing iam permissions)
No app log
Audit log with a clear status message (see code/tests)

i-B (SSM is unhealthy)
No app log
Audit log with a clear status message (see code/tests)

i-C (windows instance)
No app log
Audit log with a clear status message (see code/tests)

If any other error happens, it will still be reported in the generic
handler for the SendCommand API call.

Given this is a new API call, if the Role does not allow it, a log
warning is emitted and the behavior is the same as before.

* best effort on emitting events

* improve maxresults param
marcoandredinis added a commit that referenced this pull request May 22, 2024
* Improve unavailable messages when EC2 Auto Discover with SSM fails

EC2 Auto Discover calls ssm:SendCommand to install teleport in a set of
EC2 Instances.
This requires that the SSM Agent to be running and reporting back to the
AWS SSM Service.

This PR adds a new API call which is used to query the current status of
the SSM agent in the target EC2 instance.

If the agent did not register, is not currently online or the EC2
instance is running an unsupported operating system, an error is
reported.
The specific error is returned and the user can see this in the Audit
Log.

As an example, let's say we have 3 instances:
- i-A: missing IAM permissions to connect to SSM
- i-B: SSM ran but is now unhealthy
- i-C: instance is running Windows

Previously we had the following observable output after running the
Discovery Service:
i-A (missing iam permissions)
Log message with stack trace indicating that "instance is not valid for
account" with link for further troubleshoot.
No audit log was emitted

i-B (SSM is unhealthy)
No app log, but audit log with status:failed and exit_code:-1

i-C (windows instance)
No app log, but audit log with status:success and exit_code:0

After this PR, the following is reported:
i-A (missing iam permissions)
No app log
Audit log with a clear status message (see code/tests)

i-B (SSM is unhealthy)
No app log
Audit log with a clear status message (see code/tests)

i-C (windows instance)
No app log
Audit log with a clear status message (see code/tests)

If any other error happens, it will still be reported in the generic
handler for the SendCommand API call.

Given this is a new API call, if the Role does not allow it, a log
warning is emitted and the behavior is the same as before.

* best effort on emitting events

* improve maxresults param
github-merge-queue bot pushed a commit that referenced this pull request May 22, 2024
…dit log (#41664)

* Improve messages when EC2 Auto Discover with SSM fails (#41465)

* Improve unavailable messages when EC2 Auto Discover with SSM fails

EC2 Auto Discover calls ssm:SendCommand to install teleport in a set of
EC2 Instances.
This requires that the SSM Agent to be running and reporting back to the
AWS SSM Service.

This PR adds a new API call which is used to query the current status of
the SSM agent in the target EC2 instance.

If the agent did not register, is not currently online or the EC2
instance is running an unsupported operating system, an error is
reported.
The specific error is returned and the user can see this in the Audit
Log.

As an example, let's say we have 3 instances:
- i-A: missing IAM permissions to connect to SSM
- i-B: SSM ran but is now unhealthy
- i-C: instance is running Windows

Previously we had the following observable output after running the
Discovery Service:
i-A (missing iam permissions)
Log message with stack trace indicating that "instance is not valid for
account" with link for further troubleshoot.
No audit log was emitted

i-B (SSM is unhealthy)
No app log, but audit log with status:failed and exit_code:-1

i-C (windows instance)
No app log, but audit log with status:success and exit_code:0

After this PR, the following is reported:
i-A (missing iam permissions)
No app log
Audit log with a clear status message (see code/tests)

i-B (SSM is unhealthy)
No app log
Audit log with a clear status message (see code/tests)

i-C (windows instance)
No app log
Audit log with a clear status message (see code/tests)

If any other error happens, it will still be reported in the generic
handler for the SendCommand API call.

Given this is a new API call, if the Role does not allow it, a log
warning is emitted and the behavior is the same as before.

* best effort on emitting events

* improve maxresults param

* Add SSM Commands stdout/err to audit log (#41478)

This PR adds two new fields to the SSMRun audit events:
-stdout
-stderr

This will help diagnose the failures of teleport installations in EC2
instances using SSM (EC2 Auto Discover).

* SSMRun Audit Event: add invocation url (#41663)

This PR adds a new field in the SSMRun audit event: invocation url.

EC2 Auto Discover uses SSM to install teleport in the target instance.
An invocation is the execution of a Command in an Instance.
This URL points to that invocation and users can more easily debug what
went wrong and how they can fix in case of a failure.

* EC2 Auto Discover with SSM: add script stdout and stderr to audit log (#41479)

This PRs fills in the stdout and stderr fields of the SSMRun audit
event.
The script to install teleport in ec2 instances has two steps: download
and run shell script.

This will help diagnose what failed during the auto discover of ec2
instances.

* Fix EC2 Auto Discover SSM failure when sending an extra param (#41532)

For agentless installations we would send an extra param to the
ssm:SendCommand API.
Customers can create and use custom SSM Documents, however, when using
the default one, that parameter does not exist.
The ssm:SendCommand API returns an error if an extra param is sent.

This PR does a best-effort to accomodate for that: if a known error is
returned and the known extra param was sent, remove it and try again.

* EC2 Auto Discover with SSM: add invocation url to audit log (#41689)

This PR adds the invocation URL into the audit log when running the
teleport installer script during EC2 Auto Discover.

* EC2 Auto Discover SSM: add support for debugging custom SSM Docs (#41706)

This PR uses a new AWS API that list the steps of the current
invocation.
After listing them, it will ask for the output of each one.

Previously, we were using a static list of steps: those defined in the
default SSM Document.

However, for custom documts with different list of steps that would
fail.

If the client does not have access to this new API, we will fallback to
the list of steps that exist in the default SSM Document.

If we ask for a status of one of those steps, and we receive a known
error indicating that the step does not exist, instead of failing we
will emit the overall invocation result (which doesnt include
stdout/stderr, but better than nothing)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/branch/v15 discovery documentation no-changelog Indicates that a PR does not require a changelog entry size/md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants