`sky launch` takes ~5s to print out optimizer table, which is slow #3159

concretevitamin · 2024-02-14T21:39:35Z

sky launch takes ~5s to print out optimizer table, which is slow

this doesn't include catalog files refresh, which add a few seconds more

Test:

Based off master fe14520
Branch/commit adding a simple cprofile block around sky launch: bfee932
How to get results: type sky launch, press ctrl-c as soon as confirmation prompt shows
Results: https://gist.github.com/concretevitamin/8072424ccb90543bac0df159e6222b78

Top few lines

         1215743 function calls (1189185 primitive calls) in 4.759 seconds

   Ordered by: cumulative time
   List reduced from 5569 to 331 due to restriction <'sky'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.758    4.758 /skypilot/sky/cli.py:805(_launch_with_confirm)
        1    0.000    0.000    4.351    4.351 /skypilot/sky/optimizer.py:103(optimize)
        1    0.000    0.000    4.351    4.351 /skypilot/sky/optimizer.py:934(_optimize_dag)
        1    0.001    0.001    4.275    4.275 /skypilot/sky/optimizer.py:231(_estimate_nodes_cost_or_time)
      646    0.002    0.000    2.648    0.004 /skypilot/sky/clouds/service_catalog/__init__.py:22(_map_clouds_catalog)
        1    0.000    0.000    1.929    1.929 /skypilot/sky/optimizer.py:1144(_fill_in_launchable_resources)
      169    0.001    0.000    1.676    0.010 /skypilot/sky/resources.py:944(get_reservations_available_resources)
      121    0.000    0.000    1.675    0.014 /skypilot/sky/clouds/gcp.py:936(get_reservations_available_resources)
      121    0.000    0.000    1.675    0.014 /skypilot/sky/clouds/utils/gcp_utils.py:121(list_reservations_for_instance_type_in_zone)
        1    0.000    0.000    1.674    1.674 /skypilot/sky/clouds/utils/gcp_utils.py:129(_list_reservations_for_instance_type)
        1    0.000    0.000    1.674    1.674 /skypilot/sky/utils/subprocess_utils.py:144(run_with_retries)
        1    0.001    0.001    1.674    1.674 /skypilot/sky/skylet/log_lib.py:129(run_with_log)
        1    0.000    0.000    1.657    1.657 /skypilot/sky/skylet/log_lib.py:105(process_subprocess_stream)
        1    0.000    0.000    1.640    1.640 /skypilot/sky/skylet/log_lib.py:53(_handle_io_stream)
        6    0.000    0.000    1.388    0.231 /skypilot/sky/clouds/cloud.py:321(get_feasible_launchable_resources)
        6    0.000    0.000    1.384    0.231 /skypilot/sky/clouds/service_catalog/__init__.py:180(get_default_instance_type)
        1    0.000    0.000    0.788    0.788 /skypilot/sky/clouds/aws.py:419(_get_feasible_launchable_resources)
        1    0.000    0.000    0.787    0.787 /skypilot/sky/clouds/aws.py:338(get_default_instance_type)
        1    0.000    0.000    0.752    0.752 /skypilot/sky/clouds/service_catalog/aws_catalog.py:215(get_default_instance_type)
      102    0.000    0.000    0.731    0.007 /skypilot/sky/clouds/service_catalog/aws_catalog.py:154(_get_df)
        1    0.009    0.009    0.731    0.731 /skypilot/sky/clouds/service_catalog/aws_catalog.py:93(_fetch_and_apply_az_mapping)
      292    0.020    0.000    0.681    0.002 /skypilot/sky/clouds/service_catalog/common.py:146(_get_instance_type)
        1    0.000    0.000    0.676    0.676 /skypilot/sky/clouds/aws.py:595(get_current_user_identity)
      168    0.000    0.000    0.626    0.004 /skypilot/sky/resources.py:890(get_cost)
      168    0.000    0.000    0.626    0.004 /skypilot/sky/clouds/service_catalog/__init__.py:152(get_hourly_cost)
      168    0.001    0.000    0.621    0.004 /skypilot/sky/clouds/service_catalog/common.py:236(get_hourly_cost_impl)
        6    0.000    0.000    0.541    0.090 /skypilot/sky/optimizer.py:1086(_make_launchables_for_valid_region_zones)
        1    0.000    0.000    0.524    0.524 /skypilot/sky/clouds/oci.py:300(_get_feasible_launchable_resources)
        1    0.000    0.000    0.523    0.523 /skypilot/sky/clouds/oci.py:167(get_default_instance_type)
      175    0.001    0.000    0.520    0.003 /skypilot/sky/resources.py:1090(copy)
      181    0.001    0.000    0.518    0.003 /skypilot/sky/resources.py:48(__init__)
        1    0.000    0.000    0.514    0.514 /skypilot/sky/clouds/service_catalog/oci_catalog.py:101(get_default_instance_type)
      121    0.000    0.000    0.507    0.004 /skypilot/sky/clouds/gcp.py:274(instance_type_to_hourly_cost)
      121    0.000    0.000    0.504    0.004 /skypilot/sky/clouds/service_catalog/gcp_catalog.py:214(get_hourly_cost)
       16    0.000    0.000    0.487    0.030 /skypilot/sky/clouds/service_catalog/oci_catalog.py:35(_get_df)
      181    0.000    0.000    0.446    0.002 /skypilot/sky/resources.py:553(_validate_and_set_region_zone)
      168    0.000    0.000    0.445    0.003 /skypilot/sky/clouds/service_catalog/__init__.py:127(validate_region_zone)
      168    0.001    0.000    0.440    0.003 /skypilot/sky/clouds/service_catalog/common.py:166(validate_region_zone_impl)
      146    0.000    0.000    0.436    0.003 /skypilot/sky/clouds/cloud.py:463(validate_region_zone)
      168    0.019    0.000    0.392    0.002 /skypilot/sky/clouds/service_catalog/common.py:352(_filter_region_zone)
...

Looks like at least a few things to optimize:

list_reservations_for_instance_type_in_zone
_fetch_and_apply_az_mapping
...

The text was updated successfully, but these errors were encountered:

concretevitamin · 2024-03-07T23:42:29Z

Still relevant in a many-cloud setting. With latest #3280 merged and 6 clouds enabled, sky launch takes ~3.8s on my laptop.

aylei · 2024-12-17T10:33:06Z

@concretevitamin If there is no active development on this issue, I'd like to take it up.

aylei · 2024-12-17T15:23:46Z

I just reproduced this issue with 7 providers enabled on my test branch aylei@ffdec30 (based off f0ebf13) and get some thoughts on the solution, correct me if I wrong:

> sky check
...
🎉 Enabled clouds 🎉
  ✔ AWS
  ✔ Azure
  ✔ Cudo
  ✔ GCP
  ✔ Kubernetes
  ✔ Paperspace
  ✔ RunPod

> SKYPILOT_TIMELINE_FILE_PATH=timeline.json sky launch --gpus H100:8
Considered resources (1 node):
-------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE         vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE        COST ($)   CHOSEN
-------------------------------------------------------------------------------------------------------
 RunPod       8x_H100_SECURE   128     640       H100:8         CA                 35.92         ✔
 Paperspace   H100x8           128     640       H100:8         East Coast (NY2)   47.60
 GCP          a3-highgpu-8g    208     1872      H100:8         us-central1-a      87.83
 AWS          p5.48xlarge      192     2048      H100:8         us-east-1          98.32
-------------------------------------------------------------------------------------------------------
Aborted!

> head -15 profile_stats.txt
         2225298 function calls (2193864 primitive calls) in 3.466 seconds

   Ordered by: cumulative time
   List reduced from 8052 to 7609 due to restriction <'sky'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    3.466    3.466 /Users/aylei/repo/skypilot-org/skypilot/sky/cli.py:557(_launch_with_confirm)
     38/2    0.000    0.000    3.315    1.658 /Users/aylei/repo/skypilot-org/skypilot/sky/utils/common_utils.py:376(_record)
        1    0.000    0.000    3.313    3.313 /Users/aylei/repo/skypilot-org/skypilot/sky/optimizer.py:108(optimize)
        1    0.000    0.000    3.311    3.311 /Users/aylei/repo/skypilot-org/skypilot/sky/optimizer.py:993(_optimize_dag)
        1    0.000    0.000    3.302    3.302 /Users/aylei/repo/skypilot-org/skypilot/sky/optimizer.py:240(_estimate_nodes_cost_or_time)
        1    0.000    0.000    3.085    3.085 /Users/aylei/repo/skypilot-org/skypilot/sky/optimizer.py:1257(_fill_in_launchable_resources)
        7    0.000    0.000    3.008    0.430 /Users/aylei/repo/skypilot-org/skypilot/sky/clouds/cloud.py:371(get_feasible_launchable_resources)
      136    0.000    0.000    2.720    0.020 /Users/aylei/repo/skypilot-org/skypilot/sky/clouds/service_catalog/__init__.py:21(_map_clouds_catalog)
        6    0.000    0.000    2.426    0.404 /Users/aylei/repo/skypilot-org/skypilot/sky/clouds/service_catalog/__init__.py:247(get_instance_type_for_accelerator)

According to the stats (profile_stats.txt) _fill_in_launchable_resources takes 3.085s out of 3.466s. Focusing this particular function in timeline events (timeline.json), the main bottleneck I've identified is that the get_feasible_launchable_resources method of each cloud is processed in serial:

> cat timeline.json| jq -r '.traceEvents.[] | "\(.name), \(.ph): \(.ts)"' | grep -v get_df
sky.backends.backend_utils.refresh_cluster_status_handle, B:  1734446409752685.000
sky.backends.backend_utils.refresh_cluster_status_handle, E:  1734446409754911.000
sky.optimizer.Optimizer.optimize, B:  1734446409755096.000
sky.optimizer._fill_in_launchable_resources, B:  1734446409757335.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446409757601.000
sky.clouds.kubernetes.Kubernetes._get_feasible_launchable_resources, B:  1734446410329253.250
sky.clouds.kubernetes.Kubernetes._get_feasible_launchable_resources, E:  1734446410337647.250
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446410337653.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446410337664.000
sky.clouds.gcp.GCP._get_feasible_launchable_resources, B:  1734446410337706.000
sky.clouds.gcp.GCP._get_feasible_launchable_resources, E:  1734446410634607.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446410634616.250
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446410683903.000
sky.clouds.azure.Azure._get_feasible_launchable_resources, B:  1734446410683945.000
sky.clouds.azure.Azure._get_feasible_launchable_resources, E:  1734446410696483.250
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446410696488.750
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446410696500.000
sky.clouds.cudo.Cudo._get_feasible_launchable_resources, B:  1734446410696539.000
sky.clouds.cudo.Cudo._get_feasible_launchable_resources, E:  1734446411555284.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446411555307.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446411555342.250
sky.clouds.aws.AWS._get_feasible_launchable_resources, B:  1734446411555458.000
sky.clouds.service_catalog.aws_catalog._fetch_and_apply_az_mapping, B:  1734446411561200.750
sky.clouds.service_catalog.aws_catalog._fetch_and_apply_az_mapping, E:  1734446412795312.250
sky.clouds.aws.AWS._get_feasible_launchable_resources, E:  1734446412810339.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446412810345.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446412831575.250
sky.clouds.runpod.RunPod._get_feasible_launchable_resources, B:  1734446412831607.000
sky.clouds.runpod.RunPod._get_feasible_launchable_resources, E:  1734446412834569.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446412834573.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446412839013.000
sky.clouds.paperspace.Paperspace._get_feasible_launchable_resources, B:  1734446412839048.000
sky.clouds.paperspace.Paperspace._get_feasible_launchable_resources, E:  1734446412841263.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446412841266.750
sky.optimizer._fill_in_launchable_resources, E:  1734446412842773.000
sky.optimizer.Optimizer.optimize, E:  1734446413068050.000

Looking at the individual providers, the slowest one is AWS. get_user_identities constantly takes >1 second in my 5 runs. Meanwhile, the _unsupported_features_for_resources of Kubernetes involves RPC calls to list Nodes, which would have unpredictable latency (in my test, I used a local Kind cluster on my laptop, so the network latency is stable and minimal).

According to above investigation, I'd like to break down this issue into several sub issues:

make get_feasible_launchable_resources concurrent, which I think is the most crucial part as it affects the scalability in multi-cloud scenario;
dive into the get_feasible_launchable_resources impl of each provider and investigate whether there are available optimizations, especially for AWS and K8S;
~~(for long-term observability) replace the home-made timeline.Event with trace standards of OpenTelemetry, so that the performance can be analyzed and tracked more easily.~~
UPDATE(2024-12-20): My previous understanding is wrong, just read the standards timeline.py follows and found it can be visualized by various tools. It is nice and we really need to document it in our dev guide~

@concretevitamin @Michaelvll would you please kindly review this approach?

Michaelvll · 2024-12-17T23:12:54Z

That sounds great @aylei! Thanks for sharing the detailed analysis. Let's work on the proposals you mentioned. : )

aylei · 2024-12-18T13:54:57Z

That sounds great @aylei! Thanks for sharing the detailed analysis. Let's work on the proposals you mentioned. : )

Thanks @Michaelvll! I will push forward.

aylei · 2024-12-20T06:29:09Z

With #4483 close to merge, I'd like to share my thoughts on the aws and k8s part.

As far as I understand, in order to build an accurate optimizer table, AWS identity and K8S node info are necessary. Since they are now fetched via RPC in parallel, let's focus on the slower AWS call (which is the bottle in parallelization, cost ~1s). An natural idea is to cache the result between sky launch calls in sqlite or local filesystem. However, I think there is no reasonable cache key option. A possible solution is to hash the output of aws configure list as the cache key:

$ aws configure list
      Name                    Value             Type    Location
      ----                    -----             ----    --------
   profile                <not set>             None    None
access_key     ****************ISWR shared-credentials-file
secret_key     ****************5EEV shared-credentials-file
    region                us-west-2      config-file    ~/.aws/config

$ aws configure list help
NAME
       list -

DESCRIPTION
       Lists the profile, access key, secret key, and region configuration
       information used for the specified profile. For each configuration
       item, it shows the value, where the configuration value was retrieved,
       and the configuration variable name.

       For example, if you provide the AWS region in an environment variable,
       this command shows you the name of the region you've configured, that
       this value came from an environment variable, and the name of the
       environment variable.

       For temporary credential methods such as roles and IAM Identity Center,
       this command displays the temporarily cached access key and secret
       access key is displayed.

This works since changes of aws credentials would finally reflected in the output and the last 8 letters of the key displayed indicate rare collision rate. However, this is not an authoritative solution and I don't think we can expect the semantics of this command to be stable. Personally I think the current approach is fine and the above optimization does not payoff.

I think I am stuck here, what do you recommend? @Michaelvll Any thoughts would be highly appreciated!

aylei · 2024-12-20T13:35:20Z

As for k8s, node info can be cached with a TTL, however, since node info is highly dynamic (a hypothetical but reasonable case is when sky launch does not find any node fit, the user may modify labels of existing nodes or add a new node to the cluster and try again). Therefore, a command-line-flag to refresh the cache would be a must-have, by which the added mental burden would beat the performance gain in my point of view.

Michaelvll · 2024-12-21T02:04:11Z

Hi @aylei, thanks for the analysis. Actually, it seems like using the aws configure list is not that bad idea, as get-caller-identity is actually on the critical path of many of SkyPilot operations, such as sky status, sky exec, if the caching can save us 1s on many of the operations, it might be good to try. : )

aylei · 2024-12-21T13:19:14Z

Hi @aylei, thanks for the analysis. Actually, it seems like using the aws configure list is not that bad idea, as get-caller-identity is actually on the critical path of many of SkyPilot operations, such as sky status, sky exec, if the caching can save us 1s on many of the operations, it might be good to try. : )

Sounds good! I will investigate it.

concretevitamin added the help wanted Extra attention is needed label Feb 14, 2024

Michaelvll mentioned this issue Mar 5, 2024

[Core] Faster optimizer table by disabling reservation check #3280

Merged

5 tasks

Michaelvll closed this as completed in #3280 Mar 7, 2024

concretevitamin reopened this Mar 7, 2024

github-actions bot added the Stale label Jul 6, 2024

skypilot-org deleted a comment from github-actions bot Jul 6, 2024

Michaelvll removed the Stale label Jul 6, 2024

github-actions bot added the Stale label Nov 4, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 15, 2024

Michaelvll reopened this Dec 12, 2024

github-actions bot removed the Stale label Dec 13, 2024

skypilot-org deleted a comment from github-actions bot Dec 16, 2024

aylei mentioned this issue Dec 18, 2024

[Core] make per-cloud catalog lookup parallel #4483

Open

5 tasks

aylei mentioned this issue Dec 20, 2024

[UX] minor optimizations for launch and introduce py-spy #4495

Open

5 tasks

aylei mentioned this issue Dec 25, 2024

[aws] cache user identity by 'aws configure list' #4507

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`sky launch` takes ~5s to print out optimizer table, which is slow #3159

`sky launch` takes ~5s to print out optimizer table, which is slow #3159

concretevitamin commented Feb 14, 2024

concretevitamin commented Mar 7, 2024

aylei commented Dec 17, 2024

aylei commented Dec 17, 2024 •

edited

Loading

Michaelvll commented Dec 17, 2024

aylei commented Dec 18, 2024

aylei commented Dec 20, 2024 •

edited

Loading

aylei commented Dec 20, 2024 •

edited

Loading

Michaelvll commented Dec 21, 2024

aylei commented Dec 21, 2024

sky launch takes ~5s to print out optimizer table, which is slow #3159

sky launch takes ~5s to print out optimizer table, which is slow #3159

Comments

concretevitamin commented Feb 14, 2024

concretevitamin commented Mar 7, 2024

aylei commented Dec 17, 2024

aylei commented Dec 17, 2024 • edited Loading

Michaelvll commented Dec 17, 2024

aylei commented Dec 18, 2024

aylei commented Dec 20, 2024 • edited Loading

aylei commented Dec 20, 2024 • edited Loading

Michaelvll commented Dec 21, 2024

aylei commented Dec 21, 2024

`sky launch` takes ~5s to print out optimizer table, which is slow #3159

`sky launch` takes ~5s to print out optimizer table, which is slow #3159

aylei commented Dec 17, 2024 •

edited

Loading

aylei commented Dec 20, 2024 •

edited

Loading

aylei commented Dec 20, 2024 •

edited

Loading