-
Notifications
You must be signed in to change notification settings - Fork 538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sky launch
takes ~5s to print out optimizer table, which is slow
#3159
Comments
Still relevant in a many-cloud setting. With latest #3280 merged and 6 clouds enabled, |
@concretevitamin If there is no active development on this issue, I'd like to take it up. |
I just reproduced this issue with 7 providers enabled on my test branch aylei@ffdec30 (based off f0ebf13) and get some thoughts on the solution, correct me if I wrong: > sky check
...
🎉 Enabled clouds 🎉
✔ AWS
✔ Azure
✔ Cudo
✔ GCP
✔ Kubernetes
✔ Paperspace
✔ RunPod
> SKYPILOT_TIMELINE_FILE_PATH=timeline.json sky launch --gpus H100:8
Considered resources (1 node):
-------------------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
-------------------------------------------------------------------------------------------------------
RunPod 8x_H100_SECURE 128 640 H100:8 CA 35.92 ✔
Paperspace H100x8 128 640 H100:8 East Coast (NY2) 47.60
GCP a3-highgpu-8g 208 1872 H100:8 us-central1-a 87.83
AWS p5.48xlarge 192 2048 H100:8 us-east-1 98.32
-------------------------------------------------------------------------------------------------------
Aborted!
> head -15 profile_stats.txt
2225298 function calls (2193864 primitive calls) in 3.466 seconds
Ordered by: cumulative time
List reduced from 8052 to 7609 due to restriction <'sky'>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 3.466 3.466 /Users/aylei/repo/skypilot-org/skypilot/sky/cli.py:557(_launch_with_confirm)
38/2 0.000 0.000 3.315 1.658 /Users/aylei/repo/skypilot-org/skypilot/sky/utils/common_utils.py:376(_record)
1 0.000 0.000 3.313 3.313 /Users/aylei/repo/skypilot-org/skypilot/sky/optimizer.py:108(optimize)
1 0.000 0.000 3.311 3.311 /Users/aylei/repo/skypilot-org/skypilot/sky/optimizer.py:993(_optimize_dag)
1 0.000 0.000 3.302 3.302 /Users/aylei/repo/skypilot-org/skypilot/sky/optimizer.py:240(_estimate_nodes_cost_or_time)
1 0.000 0.000 3.085 3.085 /Users/aylei/repo/skypilot-org/skypilot/sky/optimizer.py:1257(_fill_in_launchable_resources)
7 0.000 0.000 3.008 0.430 /Users/aylei/repo/skypilot-org/skypilot/sky/clouds/cloud.py:371(get_feasible_launchable_resources)
136 0.000 0.000 2.720 0.020 /Users/aylei/repo/skypilot-org/skypilot/sky/clouds/service_catalog/__init__.py:21(_map_clouds_catalog)
6 0.000 0.000 2.426 0.404 /Users/aylei/repo/skypilot-org/skypilot/sky/clouds/service_catalog/__init__.py:247(get_instance_type_for_accelerator) According to the stats (profile_stats.txt) > cat timeline.json| jq -r '.traceEvents.[] | "\(.name), \(.ph): \(.ts)"' | grep -v get_df
sky.backends.backend_utils.refresh_cluster_status_handle, B: 1734446409752685.000
sky.backends.backend_utils.refresh_cluster_status_handle, E: 1734446409754911.000
sky.optimizer.Optimizer.optimize, B: 1734446409755096.000
sky.optimizer._fill_in_launchable_resources, B: 1734446409757335.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B: 1734446409757601.000
sky.clouds.kubernetes.Kubernetes._get_feasible_launchable_resources, B: 1734446410329253.250
sky.clouds.kubernetes.Kubernetes._get_feasible_launchable_resources, E: 1734446410337647.250
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E: 1734446410337653.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B: 1734446410337664.000
sky.clouds.gcp.GCP._get_feasible_launchable_resources, B: 1734446410337706.000
sky.clouds.gcp.GCP._get_feasible_launchable_resources, E: 1734446410634607.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E: 1734446410634616.250
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B: 1734446410683903.000
sky.clouds.azure.Azure._get_feasible_launchable_resources, B: 1734446410683945.000
sky.clouds.azure.Azure._get_feasible_launchable_resources, E: 1734446410696483.250
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E: 1734446410696488.750
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B: 1734446410696500.000
sky.clouds.cudo.Cudo._get_feasible_launchable_resources, B: 1734446410696539.000
sky.clouds.cudo.Cudo._get_feasible_launchable_resources, E: 1734446411555284.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E: 1734446411555307.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B: 1734446411555342.250
sky.clouds.aws.AWS._get_feasible_launchable_resources, B: 1734446411555458.000
sky.clouds.service_catalog.aws_catalog._fetch_and_apply_az_mapping, B: 1734446411561200.750
sky.clouds.service_catalog.aws_catalog._fetch_and_apply_az_mapping, E: 1734446412795312.250
sky.clouds.aws.AWS._get_feasible_launchable_resources, E: 1734446412810339.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E: 1734446412810345.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B: 1734446412831575.250
sky.clouds.runpod.RunPod._get_feasible_launchable_resources, B: 1734446412831607.000
sky.clouds.runpod.RunPod._get_feasible_launchable_resources, E: 1734446412834569.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E: 1734446412834573.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B: 1734446412839013.000
sky.clouds.paperspace.Paperspace._get_feasible_launchable_resources, B: 1734446412839048.000
sky.clouds.paperspace.Paperspace._get_feasible_launchable_resources, E: 1734446412841263.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E: 1734446412841266.750
sky.optimizer._fill_in_launchable_resources, E: 1734446412842773.000
sky.optimizer.Optimizer.optimize, E: 1734446413068050.000 Looking at the individual providers, the slowest one is AWS. According to above investigation, I'd like to break down this issue into several sub issues:
@concretevitamin @Michaelvll would you please kindly review this approach? |
That sounds great @aylei! Thanks for sharing the detailed analysis. Let's work on the proposals you mentioned. : ) |
Thanks @Michaelvll! I will push forward. |
With #4483 close to merge, I'd like to share my thoughts on the aws and k8s part. As far as I understand, in order to build an accurate optimizer table, AWS identity and K8S node info are necessary. Since they are now fetched via RPC in parallel, let's focus on the slower AWS call (which is the bottle in parallelization, cost ~1s). An natural idea is to cache the result between
This works since changes of aws credentials would finally reflected in the output and the last 8 letters of the key displayed indicate rare collision rate. However, this is not an authoritative solution and I don't think we can expect the semantics of this command to be stable. Personally I think the current approach is fine and the above optimization does not payoff. I think I am stuck here, what do you recommend? @Michaelvll Any thoughts would be highly appreciated! |
As for k8s, node info can be cached with a TTL, however, since node info is highly dynamic (a hypothetical but reasonable case is when |
Hi @aylei, thanks for the analysis. Actually, it seems like using the |
Sounds good! I will investigate it. |
sky launch
takes ~5s to print out optimizer table, which is slowTest:
sky launch
: bfee932sky launch
, pressctrl-c
as soon as confirmation prompt showsTop few lines
Looks like at least a few things to optimize:
The text was updated successfully, but these errors were encountered: