-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#527 autoscaler test #563
#527 autoscaler test #563
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there such a large difference in runtime?
Reviewed all commit messages.
Reviewable status: 0 of 2 files reviewed, all discussions resolved (waiting on @johnml1135)
Is it because it's not running on John's smaller gpus now? What kind of gpus does the autoscaler spin up, Matthew? |
We talked about it in our standup, recording here that the reason is that there is a startup cost associated with launching a GCP instance for the first SMT job as well as for the first NMT job. And the GPU type that the autoscaler spins up for the NMT jobs is an A100 40GB. |
Just saying it out loud:
|
(If I could add)
|
Point taken. There is only one GPU available for running these tests (my 3090). If, on the other hand, we use the AQUA server for the CPU jobs and the autoscalar for the GPU jobs, we still can save some time and not run into the bottlenecks. We would not save as much money, but honestly, I am more desirous of the times being short than the cost being less. In October, we had 27 commits to master (a pretty high month). Extrapolating that, we could spend 27120.20 (assuming half the time spent using GPUs) = $65 in total GPU cost for running these tests per year, hardly breaking the bank. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we run the CPU jobs on the AQuA server, might we still run into bottlenecks if the AQuA server is full?
Also, the CPU jobs are much cheaper, since I'm running them on an e2-standard-32 machine type rather than an accelerator optimized machine type. It currently has a spot price of $0.4607/hr (and $1.3147/hr on demand).
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)
The differences are likely less than the cost of us talking about it. Let's just do it. |
This PR moves the E2E tests to the autoscaler. The testing step took 22m 43s (24m 51s total), compared to 10m 4s (12m 0s total) if not using the autoscaler. Initial startup time is really only a factor for the first NMT job and the first SMT job, since the autoscaler can reuse instances.
Also, the reason the environment variables I added have
CLEARML
in all caps was to distinguish that they're not an official environment variable from ClearML, which usesClearML
to prepend its own environment variables. But I can change the environment variable names to match or to something else entirely if that's preferable.This change is