Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UX] warning before launching jobs/serve when using a reauth required credentials #4479

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

weih1121
Copy link
Contributor

@weih1121 weih1121 commented Dec 18, 2024

issue link: #4433
Solution:

context:

  • Jobs controller or server need permission to launch/down nodes of multiple clouds.

detail:

  • Get clouds whose remote identities are LOCAL_CREDENTIALS and credentials are expiring and warning to the user before provision.

Test case 1:
config in ~/.sky/config.yaml

gcp:
    remote_identity: LOCAL_CREDENTIALS

aws:
    remote_identity: LOCAL_CREDENTIALS

jobs:
   controller:
       resources:
          cloud: gcp

Local AWS credential configured by SSO
Launch controller on GCP and task on AWS
Command to run:

python cli.py jobs launch ~/hello-sky/hello_sky.yaml -y

Logs:

~/skypilot/sky on dev/hong/controller wip *1 > python cli.py jobs launch ~/hello-sky/hello_sky.yaml -y                      took 49s py sky at 17:27:04
Task from YAML spec: /Users/hong/hello-sky/hello_sky.yaml
Resources for managed job 'sky-17a6-hong' will be computed on the managed jobs controller, since --yes is set.
⚙︎ Translating workdir to SkyPilot Storage...
  Workdir: '.' -> storage: 'skypilot-workdir-hong-a0c33aaf'.
  Created S3 bucket 'skypilot-workdir-hong-a0c33aaf' in us-east-1
  Excluded files to sync to cluster based on .gitignore.
✓ Uploaded local files/folders.
Launching managed job 'sky-17a6-hong' from jobs controller...
Choosing resources for managed jobs controller...
Considered resources (1 node):
----------------------------------------------------------------------------------------------
 CLOUD   INSTANCE        vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN
----------------------------------------------------------------------------------------------
 GCP     n2-standard-8   8       32        -              us-central1-a   0.39          ✔
----------------------------------------------------------------------------------------------
⠹ Launching  View logs at: ~/sky_logs/sky-2025-01-02-17-28-17-672434/provision.log
Warning: Expiring credentials detected for [GCP]. Clusters may be leaked if the credentials expire while jobs are running. It is recommended to use credentials that never expire or a service account.
⠼ Launching  View logs at: ~/sky_logs/sky-2025-01-02-17-28-17-672434/provision.logINFO:oci.circuit_breaker:Default Auth client Circuit breaker strategy enabled
⚙︎ Launching managed jobs controller on GCP us-central1 (us-central1-a).

Warnings detail:

Warning: Expiring credentials detected for [GCP]. Clusters may be leaked if the credentials expire while jobs are running. It is recommended to use credentials that never expire or a service account.

worker provision failed since AWS not accessible from GCP cloud.

Test case 2:
Provision controller on AWS and worker on GCP

Launching managed job 'sky-40f6-hong' from jobs controller...
Choosing resources for managed jobs controller...
Considered resources (1 node):
------------------------------------------------------------------------------------------
 CLOUD   INSTANCE      vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
------------------------------------------------------------------------------------------
 AWS     m6i.2xlarge   8       32        -              us-east-1     0.38          ✔
------------------------------------------------------------------------------------------
⠴ Launching  View logs at: ~/sky_logs/sky-2025-01-02-18-04-56-343901/provision.log
Warning: Expiring credentials detected for [GCP]. Clusters may be leaked if the credentials expire while jobs are running. It is recommended to use credentials that never expire or a service account.
⠹ Launching  View logs at: ~/sky_logs/sky-2025-01-02-18-04-56-343901/provision.logINFO:oci.circuit_breaker:Default Auth client Circuit breaker strategy enabled
⚙︎ Launching managed jobs controller on AWS us-east-1 (us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1f).

Still warning for expiring local credentials.

Job completed in this case

⚙︎ Running setup on managed jobs controller.
  Check & install cloud dependencies on controller: done.
✓ Setup completed.  View logs at: ~/sky_logs/sky-2025-01-02-18-04-56-343901/setup-*.log
⚙︎ Job submitted, ID: 1
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(setup pid=3601) Running setup.
(sky-40f6-hong, pid=3601) Hello, SkyPilot!
(sky-40f6-hong, pid=3601) # conda environments:
(sky-40f6-hong, pid=3601) #
(sky-40f6-hong, pid=3601) base                  *  /home/gcpuser/miniconda3
(sky-40f6-hong, pid=3601) skypilot-runtime         /home/gcpuser/miniconda3/envs/skypilot-runtime
(sky-40f6-hong, pid=3601)
(sky-40f6-hong, pid=3601) End task.
✓ Job finished (status: SUCCEEDED).
✓ Managed job finished: 1 (status: SUCCEEDED).

Test case 3:
same as test case 2 but using service account for GCP

Launching managed job 'sky-ae51-hong' from jobs controller...
Choosing resources for managed jobs controller...
Considered resources (1 node):
------------------------------------------------------------------------------------------
 CLOUD   INSTANCE      vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
------------------------------------------------------------------------------------------
 AWS     m6i.2xlarge   8       32        -              us-east-1     0.38          ✔
------------------------------------------------------------------------------------------
⠏ Launching  View logs at: ~/sky_logs/sky-2025-01-02-18-21-57-280832/provision.logINFO:oci.circuit_breaker:Default Auth client Circuit breaker strategy enabled
⚙︎ Launching managed jobs controller on AWS us-east-1 (us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1f).
└── Instance is up.
Open file descriptor limit (256) is low. File sync to remote clusters may be slow. Consider increasing the limit using `ulimit -n <number>` or modifying system limits.
✓ Cluster launched: sky-jobs-controller-d8a421b3.  View logs at: ~/sky_logs/sky-2025-01-02-18-21-57-280832/provision.log
⚙︎ Mounting files.
  Syncing (to 1 node): /var/folders/d1/c810pqs51p58jyq4g4d8czbh0000gn/T/managed-dag-sky-ae51-hong-4exdx6sw -> ~/.sky/managed_jobs/sky-ae51-hong-757c.yaml
  Syncing (to 1 node): /var/folders/d1/c810pqs51p58jyq4g4d8czbh0000gn/T/tmprzouvn9a -> ~/.sky/managed_jobs/sky-ae51-hong-757c.config_yaml
✓ Files synced.  View logs at: ~/sky_logs/sky-2025-01-02-18-21-57-280832/file_mounts.log
⚙︎ Running setup on managed jobs controller.
[2/3] Check & install cloud dependencies on controller: GCP SDK

Provision instance without warning message.
Job on GCP finished.

✓ Files synced.  View logs at: ~/sky_logs/sky-2025-01-02-18-21-57-280832/file_mounts.log
⚙︎ Running setup on managed jobs controller.
  Check & install cloud dependencies on controller: done.
✓ Setup completed.  View logs at: ~/sky_logs/sky-2025-01-02-18-21-57-280832/setup-*.log
⚙︎ Job submitted, ID: 1
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(setup pid=3194) Running setup.
(sky-ae51-hong, pid=3194) Hello, SkyPilot!
(sky-ae51-hong, pid=3194) # conda environments:
(sky-ae51-hong, pid=3194) #
(sky-ae51-hong, pid=3194) base                  *  /home/gcpuser/miniconda3
(sky-ae51-hong, pid=3194) skypilot-runtime         /home/gcpuser/miniconda3/envs/skypilot-runtime
(sky-ae51-hong, pid=3194)
(sky-ae51-hong, pid=3194) End task.
✓ Managed job finished: 1 (status: SUCCEEDED).

@Michaelvll @romilbhardwaj can you help review?

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@weih1121 weih1121 marked this pull request as draft December 18, 2024 05:40
@weih1121 weih1121 changed the title [UX] warning before launching jobs/serve when using a auth required credentials [UX] warning before launching jobs/serve when using a reauth required credentials Dec 18, 2024
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/clouds/aws.py Outdated Show resolved Hide resolved
@weih1121 weih1121 marked this pull request as ready for review December 18, 2024 06:40
sky/clouds/aws.py Outdated Show resolved Hide resolved
@@ -536,6 +536,10 @@ def get_credential_file_mounts(self) -> Dict[str, str]:
"""
raise NotImplementedError

def can_credential_expire(self) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
def can_credential_expire(self) -> bool:
def can_credentials_expire(self) -> bool:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checks the active credential(only one), the original make sense I think.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @weih1121!

SSO/Container-role refresh token:
https://docs.aws.amazon.com/solutions/latest/dea-api/auth-refreshtoken.html
"""
# TODO(hong): Add a check for the expiration of the temporary
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incomplete?

Suggested change
# TODO(hong): Add a check for the expiration of the temporary
# TODO(hong): Add a CLI based check for the expiration of the temporary credentials

Copy link
Contributor Author

@weih1121 weih1121 Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When provisioning an AWS instance for a jobs controller or server, we attach an IAM Role to the instance, which automatically refreshes the credentials. We only update the local credentials for clusters when the SHARED_CREDENTIALS credential type is configured. For other types such as ENV or SSO, we ignore them.

There are some reasons why I have marked the following as TODO:

  • For ENV, SHARED_CREDENTIALS, and IAM ROLE credential types, there is no valid CLI tool that can assist.
  • For SSO, although I can check the expiry from the config files in ~/.aws/sso/cache/xxx.json, the cluster does not use local SSO credentials at all.
  • The only credentials that are updated are SHARED_CREDENTIALS, which are generated from the AWS access portal. These credentials are expirable.
  • Another factor is performance concern but it is not the most important

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I meant the TODO sentence was incomplete :)

Comment on lines 2011 to 2015
warnings = (f'\nWarning: Expiring credentials detected for '
f'{expirable_clouds}. Clusters may be leaked if '
f'the credentials expire while jobs are running. '
f'It is recommended to use credentials that never'
f' expire or a service account.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we do not verify if the credentials are expiring, we should update the message:

Suggested change
warnings = (f'\nWarning: Expiring credentials detected for '
f'{expirable_clouds}. Clusters may be leaked if '
f'the credentials expire while jobs are running. '
f'It is recommended to use credentials that never'
f' expire or a service account.')
warnings = (f'\nWarning: Credentials used for {expirable_clouds} may expire. '
f' Resources may be leaked if '
f'the credentials expire while jobs are running. '
f'It is recommended to use credentials that never'
f' expire or a service account.')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines 774 to 810

credentials = sky_check.get_cloud_credential_file_mounts(excluded_clouds)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes not needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by accident, removed

f'the credentials expire while jobs are running. '
f'It is recommended to use credentials that never'
f' expire or a service account.')
click.secho(warnings, fg='yellow')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of importing and using click, can we use logger.warning with colors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, I think this choice is for performance concern, right?

@weih1121 weih1121 requested a review from romilbhardwaj January 3, 2025 04:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants