Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sky launch takes ~5s to print out optimizer table, which is slow #3159

Open
concretevitamin opened this issue Feb 14, 2024 · 9 comments · Fixed by #3280
Open

sky launch takes ~5s to print out optimizer table, which is slow #3159

concretevitamin opened this issue Feb 14, 2024 · 9 comments · Fixed by #3280
Labels
help wanted Extra attention is needed

Comments

@concretevitamin
Copy link
Member

sky launch takes ~5s to print out optimizer table, which is slow

  • this doesn't include catalog files refresh, which add a few seconds more

Test:

Top few lines

         1215743 function calls (1189185 primitive calls) in 4.759 seconds

   Ordered by: cumulative time
   List reduced from 5569 to 331 due to restriction <'sky'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.758    4.758 /skypilot/sky/cli.py:805(_launch_with_confirm)
        1    0.000    0.000    4.351    4.351 /skypilot/sky/optimizer.py:103(optimize)
        1    0.000    0.000    4.351    4.351 /skypilot/sky/optimizer.py:934(_optimize_dag)
        1    0.001    0.001    4.275    4.275 /skypilot/sky/optimizer.py:231(_estimate_nodes_cost_or_time)
      646    0.002    0.000    2.648    0.004 /skypilot/sky/clouds/service_catalog/__init__.py:22(_map_clouds_catalog)
        1    0.000    0.000    1.929    1.929 /skypilot/sky/optimizer.py:1144(_fill_in_launchable_resources)
      169    0.001    0.000    1.676    0.010 /skypilot/sky/resources.py:944(get_reservations_available_resources)
      121    0.000    0.000    1.675    0.014 /skypilot/sky/clouds/gcp.py:936(get_reservations_available_resources)
      121    0.000    0.000    1.675    0.014 /skypilot/sky/clouds/utils/gcp_utils.py:121(list_reservations_for_instance_type_in_zone)
        1    0.000    0.000    1.674    1.674 /skypilot/sky/clouds/utils/gcp_utils.py:129(_list_reservations_for_instance_type)
        1    0.000    0.000    1.674    1.674 /skypilot/sky/utils/subprocess_utils.py:144(run_with_retries)
        1    0.001    0.001    1.674    1.674 /skypilot/sky/skylet/log_lib.py:129(run_with_log)
        1    0.000    0.000    1.657    1.657 /skypilot/sky/skylet/log_lib.py:105(process_subprocess_stream)
        1    0.000    0.000    1.640    1.640 /skypilot/sky/skylet/log_lib.py:53(_handle_io_stream)
        6    0.000    0.000    1.388    0.231 /skypilot/sky/clouds/cloud.py:321(get_feasible_launchable_resources)
        6    0.000    0.000    1.384    0.231 /skypilot/sky/clouds/service_catalog/__init__.py:180(get_default_instance_type)
        1    0.000    0.000    0.788    0.788 /skypilot/sky/clouds/aws.py:419(_get_feasible_launchable_resources)
        1    0.000    0.000    0.787    0.787 /skypilot/sky/clouds/aws.py:338(get_default_instance_type)
        1    0.000    0.000    0.752    0.752 /skypilot/sky/clouds/service_catalog/aws_catalog.py:215(get_default_instance_type)
      102    0.000    0.000    0.731    0.007 /skypilot/sky/clouds/service_catalog/aws_catalog.py:154(_get_df)
        1    0.009    0.009    0.731    0.731 /skypilot/sky/clouds/service_catalog/aws_catalog.py:93(_fetch_and_apply_az_mapping)
      292    0.020    0.000    0.681    0.002 /skypilot/sky/clouds/service_catalog/common.py:146(_get_instance_type)
        1    0.000    0.000    0.676    0.676 /skypilot/sky/clouds/aws.py:595(get_current_user_identity)
      168    0.000    0.000    0.626    0.004 /skypilot/sky/resources.py:890(get_cost)
      168    0.000    0.000    0.626    0.004 /skypilot/sky/clouds/service_catalog/__init__.py:152(get_hourly_cost)
      168    0.001    0.000    0.621    0.004 /skypilot/sky/clouds/service_catalog/common.py:236(get_hourly_cost_impl)
        6    0.000    0.000    0.541    0.090 /skypilot/sky/optimizer.py:1086(_make_launchables_for_valid_region_zones)
        1    0.000    0.000    0.524    0.524 /skypilot/sky/clouds/oci.py:300(_get_feasible_launchable_resources)
        1    0.000    0.000    0.523    0.523 /skypilot/sky/clouds/oci.py:167(get_default_instance_type)
      175    0.001    0.000    0.520    0.003 /skypilot/sky/resources.py:1090(copy)
      181    0.001    0.000    0.518    0.003 /skypilot/sky/resources.py:48(__init__)
        1    0.000    0.000    0.514    0.514 /skypilot/sky/clouds/service_catalog/oci_catalog.py:101(get_default_instance_type)
      121    0.000    0.000    0.507    0.004 /skypilot/sky/clouds/gcp.py:274(instance_type_to_hourly_cost)
      121    0.000    0.000    0.504    0.004 /skypilot/sky/clouds/service_catalog/gcp_catalog.py:214(get_hourly_cost)
       16    0.000    0.000    0.487    0.030 /skypilot/sky/clouds/service_catalog/oci_catalog.py:35(_get_df)
      181    0.000    0.000    0.446    0.002 /skypilot/sky/resources.py:553(_validate_and_set_region_zone)
      168    0.000    0.000    0.445    0.003 /skypilot/sky/clouds/service_catalog/__init__.py:127(validate_region_zone)
      168    0.001    0.000    0.440    0.003 /skypilot/sky/clouds/service_catalog/common.py:166(validate_region_zone_impl)
      146    0.000    0.000    0.436    0.003 /skypilot/sky/clouds/cloud.py:463(validate_region_zone)
      168    0.019    0.000    0.392    0.002 /skypilot/sky/clouds/service_catalog/common.py:352(_filter_region_zone)
...

Looks like at least a few things to optimize:

  • list_reservations_for_instance_type_in_zone
  • _fetch_and_apply_az_mapping
  • ...
@concretevitamin
Copy link
Member Author

Still relevant in a many-cloud setting. With latest #3280 merged and 6 clouds enabled, sky launch takes ~3.8s on my laptop.

@github-actions github-actions bot added the Stale label Jul 6, 2024
@skypilot-org skypilot-org deleted a comment from github-actions bot Jul 6, 2024
@Michaelvll Michaelvll removed the Stale label Jul 6, 2024
@github-actions github-actions bot added the Stale label Nov 4, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 15, 2024
@Michaelvll Michaelvll reopened this Dec 12, 2024
@github-actions github-actions bot removed the Stale label Dec 13, 2024
@skypilot-org skypilot-org deleted a comment from github-actions bot Dec 16, 2024
@skypilot-org skypilot-org deleted a comment from github-actions bot Dec 16, 2024
@aylei
Copy link
Contributor

aylei commented Dec 17, 2024

@concretevitamin If there is no active development on this issue, I'd like to take it up.

@aylei
Copy link
Contributor

aylei commented Dec 17, 2024

I just reproduced this issue with 7 providers enabled on my test branch aylei@ffdec30 (based off f0ebf13) and get some thoughts on the solution, correct me if I wrong:

> sky check
...
🎉 Enabled clouds 🎉
  ✔ AWS
  ✔ Azure
  ✔ Cudo
  ✔ GCP
  ✔ Kubernetes
  ✔ Paperspace
  ✔ RunPod

> SKYPILOT_TIMELINE_FILE_PATH=timeline.json sky launch --gpus H100:8
Considered resources (1 node):
-------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE         vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE        COST ($)   CHOSEN
-------------------------------------------------------------------------------------------------------
 RunPod       8x_H100_SECURE   128     640       H100:8         CA                 35.92         ✔
 Paperspace   H100x8           128     640       H100:8         East Coast (NY2)   47.60
 GCP          a3-highgpu-8g    208     1872      H100:8         us-central1-a      87.83
 AWS          p5.48xlarge      192     2048      H100:8         us-east-1          98.32
-------------------------------------------------------------------------------------------------------
Aborted!

> head -15 profile_stats.txt
         2225298 function calls (2193864 primitive calls) in 3.466 seconds

   Ordered by: cumulative time
   List reduced from 8052 to 7609 due to restriction <'sky'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    3.466    3.466 /Users/aylei/repo/skypilot-org/skypilot/sky/cli.py:557(_launch_with_confirm)
     38/2    0.000    0.000    3.315    1.658 /Users/aylei/repo/skypilot-org/skypilot/sky/utils/common_utils.py:376(_record)
        1    0.000    0.000    3.313    3.313 /Users/aylei/repo/skypilot-org/skypilot/sky/optimizer.py:108(optimize)
        1    0.000    0.000    3.311    3.311 /Users/aylei/repo/skypilot-org/skypilot/sky/optimizer.py:993(_optimize_dag)
        1    0.000    0.000    3.302    3.302 /Users/aylei/repo/skypilot-org/skypilot/sky/optimizer.py:240(_estimate_nodes_cost_or_time)
        1    0.000    0.000    3.085    3.085 /Users/aylei/repo/skypilot-org/skypilot/sky/optimizer.py:1257(_fill_in_launchable_resources)
        7    0.000    0.000    3.008    0.430 /Users/aylei/repo/skypilot-org/skypilot/sky/clouds/cloud.py:371(get_feasible_launchable_resources)
      136    0.000    0.000    2.720    0.020 /Users/aylei/repo/skypilot-org/skypilot/sky/clouds/service_catalog/__init__.py:21(_map_clouds_catalog)
        6    0.000    0.000    2.426    0.404 /Users/aylei/repo/skypilot-org/skypilot/sky/clouds/service_catalog/__init__.py:247(get_instance_type_for_accelerator)

According to the stats (profile_stats.txt) _fill_in_launchable_resources takes 3.085s out of 3.466s. Focusing this particular function in timeline events (timeline.json), the main bottleneck I've identified is that the get_feasible_launchable_resources method of each cloud is processed in serial:

> cat timeline.json| jq -r '.traceEvents.[] | "\(.name), \(.ph): \(.ts)"' | grep -v get_df
sky.backends.backend_utils.refresh_cluster_status_handle, B:  1734446409752685.000
sky.backends.backend_utils.refresh_cluster_status_handle, E:  1734446409754911.000
sky.optimizer.Optimizer.optimize, B:  1734446409755096.000
sky.optimizer._fill_in_launchable_resources, B:  1734446409757335.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446409757601.000
sky.clouds.kubernetes.Kubernetes._get_feasible_launchable_resources, B:  1734446410329253.250
sky.clouds.kubernetes.Kubernetes._get_feasible_launchable_resources, E:  1734446410337647.250
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446410337653.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446410337664.000
sky.clouds.gcp.GCP._get_feasible_launchable_resources, B:  1734446410337706.000
sky.clouds.gcp.GCP._get_feasible_launchable_resources, E:  1734446410634607.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446410634616.250
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446410683903.000
sky.clouds.azure.Azure._get_feasible_launchable_resources, B:  1734446410683945.000
sky.clouds.azure.Azure._get_feasible_launchable_resources, E:  1734446410696483.250
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446410696488.750
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446410696500.000
sky.clouds.cudo.Cudo._get_feasible_launchable_resources, B:  1734446410696539.000
sky.clouds.cudo.Cudo._get_feasible_launchable_resources, E:  1734446411555284.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446411555307.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446411555342.250
sky.clouds.aws.AWS._get_feasible_launchable_resources, B:  1734446411555458.000
sky.clouds.service_catalog.aws_catalog._fetch_and_apply_az_mapping, B:  1734446411561200.750
sky.clouds.service_catalog.aws_catalog._fetch_and_apply_az_mapping, E:  1734446412795312.250
sky.clouds.aws.AWS._get_feasible_launchable_resources, E:  1734446412810339.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446412810345.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446412831575.250
sky.clouds.runpod.RunPod._get_feasible_launchable_resources, B:  1734446412831607.000
sky.clouds.runpod.RunPod._get_feasible_launchable_resources, E:  1734446412834569.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446412834573.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446412839013.000
sky.clouds.paperspace.Paperspace._get_feasible_launchable_resources, B:  1734446412839048.000
sky.clouds.paperspace.Paperspace._get_feasible_launchable_resources, E:  1734446412841263.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446412841266.750
sky.optimizer._fill_in_launchable_resources, E:  1734446412842773.000
sky.optimizer.Optimizer.optimize, E:  1734446413068050.000

Looking at the individual providers, the slowest one is AWS. get_user_identities constantly takes >1 second in my 5 runs. Meanwhile, the _unsupported_features_for_resources of Kubernetes involves RPC calls to list Nodes, which would have unpredictable latency (in my test, I used a local Kind cluster on my laptop, so the network latency is stable and minimal).

According to above investigation, I'd like to break down this issue into several sub issues:

  1. make get_feasible_launchable_resources concurrent, which I think is the most crucial part as it affects the scalability in multi-cloud scenario;

  2. dive into the get_feasible_launchable_resources impl of each provider and investigate whether there are available optimizations, especially for AWS and K8S;

  3. (for long-term observability) replace the home-made timeline.Event with trace standards of OpenTelemetry, so that the performance can be analyzed and tracked more easily.
    UPDATE(2024-12-20): My previous understanding is wrong, just read the standards timeline.py follows and found it can be visualized by various tools. It is nice and we really need to document it in our dev guide~

@concretevitamin @Michaelvll would you please kindly review this approach?

@Michaelvll
Copy link
Collaborator

That sounds great @aylei! Thanks for sharing the detailed analysis. Let's work on the proposals you mentioned. : )

@aylei
Copy link
Contributor

aylei commented Dec 18, 2024

That sounds great @aylei! Thanks for sharing the detailed analysis. Let's work on the proposals you mentioned. : )

Thanks @Michaelvll! I will push forward.

@aylei
Copy link
Contributor

aylei commented Dec 20, 2024

With #4483 close to merge, I'd like to share my thoughts on the aws and k8s part.

As far as I understand, in order to build an accurate optimizer table, AWS identity and K8S node info are necessary. Since they are now fetched via RPC in parallel, let's focus on the slower AWS call (which is the bottle in parallelization, cost ~1s). An natural idea is to cache the result between sky launch calls in sqlite or local filesystem. However, I think there is no reasonable cache key option. A possible solution is to hash the output of aws configure list as the cache key:

$ aws configure list
      Name                    Value             Type    Location
      ----                    -----             ----    --------
   profile                <not set>             None    None
access_key     ****************ISWR shared-credentials-file
secret_key     ****************5EEV shared-credentials-file
    region                us-west-2      config-file    ~/.aws/config

$ aws configure list help
NAME
       list -

DESCRIPTION
       Lists the profile, access key, secret key, and region configuration
       information used for the specified profile. For each configuration
       item, it shows the value, where the configuration value was retrieved,
       and the configuration variable name.

       For example, if you provide the AWS region in an environment variable,
       this command shows you the name of the region you've configured, that
       this value came from an environment variable, and the name of the
       environment variable.

       For temporary credential methods such as roles and IAM Identity Center,
       this command displays the temporarily cached access key and secret
       access key is displayed.

This works since changes of aws credentials would finally reflected in the output and the last 8 letters of the key displayed indicate rare collision rate. However, this is not an authoritative solution and I don't think we can expect the semantics of this command to be stable. Personally I think the current approach is fine and the above optimization does not payoff.

I think I am stuck here, what do you recommend? @Michaelvll Any thoughts would be highly appreciated!

@aylei
Copy link
Contributor

aylei commented Dec 20, 2024

As for k8s, node info can be cached with a TTL, however, since node info is highly dynamic (a hypothetical but reasonable case is when sky launch does not find any node fit, the user may modify labels of existing nodes or add a new node to the cluster and try again). Therefore, a command-line-flag to refresh the cache would be a must-have, by which the added mental burden would beat the performance gain in my point of view.

@Michaelvll
Copy link
Collaborator

Hi @aylei, thanks for the analysis. Actually, it seems like using the aws configure list is not that bad idea, as get-caller-identity is actually on the critical path of many of SkyPilot operations, such as sky status, sky exec, if the caching can save us 1s on many of the operations, it might be good to try. : )

@aylei
Copy link
Contributor

aylei commented Dec 21, 2024

Hi @aylei, thanks for the analysis. Actually, it seems like using the aws configure list is not that bad idea, as get-caller-identity is actually on the critical path of many of SkyPilot operations, such as sky status, sky exec, if the caching can save us 1s on many of the operations, it might be good to try. : )

Sounds good! I will investigate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants