Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#527 autoscaler test #563

Merged
merged 4 commits into from
Dec 10, 2024
Merged

#527 autoscaler test #563

merged 4 commits into from
Dec 10, 2024

Conversation

mshannon-sil
Copy link
Collaborator

@mshannon-sil mshannon-sil commented Dec 5, 2024

This PR moves the E2E tests to the autoscaler. The testing step took 22m 43s (24m 51s total), compared to 10m 4s (12m 0s total) if not using the autoscaler. Initial startup time is really only a factor for the first NMT job and the first SMT job, since the autoscaler can reuse instances.

Also, the reason the environment variables I added have CLEARML in all caps was to distinguish that they're not an official environment variable from ClearML, which uses ClearML to prepend its own environment variables. But I can change the environment variable names to match or to something else entirely if that's preferable.


This change is Reviewable

@mshannon-sil mshannon-sil self-assigned this Dec 5, 2024
@mshannon-sil mshannon-sil added the ci label Dec 5, 2024
@mshannon-sil mshannon-sil linked an issue Dec 5, 2024 that may be closed by this pull request
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there such a large difference in runtime?

Reviewed all commit messages.
Reviewable status: 0 of 2 files reviewed, all discussions resolved (waiting on @johnml1135)

@Enkidu93
Copy link
Collaborator

Enkidu93 commented Dec 6, 2024

Why is there such a large difference in runtime?

Reviewed all commit messages.
Reviewable status: 0 of 2 files reviewed, all discussions resolved (waiting on @johnml1135)

Is it because it's not running on John's smaller gpus now? What kind of gpus does the autoscaler spin up, Matthew?

@mshannon-sil
Copy link
Collaborator Author

We talked about it in our standup, recording here that the reason is that there is a startup cost associated with launching a GCP instance for the first SMT job as well as for the first NMT job. And the GPU type that the autoscaler spins up for the NMT jobs is an A100 40GB.

@johnml1135
Copy link
Collaborator

Just saying it out loud:

  • Cost 1: We are using the spot price of around $1.50/hour for the A100 (and less for the CPU only)? Therefore we are paying < $.40 every time we run these tests.
  • Cost 2: wen we run the E2E tests it will take 22 min rather than 10 min
  • Benefit 1: We will always be testing the autoscalar to make sure everything is working properly
  • Compromise?: Should we run one set of models (say the SMT models) on the autoscalar and the NMT on my GPU? That would save some time and most of the cost per run as well as continue to verify that the autoscalar is working properly. The only meaningful difference would be the A100's - and they are the same ones we are using in the AQUA server. I think we can get near 100% of the value with < 50% of the cost.

@Enkidu93
Copy link
Collaborator

Enkidu93 commented Dec 9, 2024

Just saying it out loud:

* Cost 1: We are using the spot price of around 
    1.50
    
      /
    
    h
    o
    u
    r
    f
    o
    r
    t
    h
    e
    A
    100
    (
    a
    n
    d
    l
    e
    s
    s
    f
    o
    r
    t
    h
    e
    C
    P
    U
    o
    n
    l
    y
    )
    ?
    T
    h
    e
    r
    e
    f
    o
    r
    e
    w
    e
    a
    r
    e
    p
    a
    y
    i
    n
    g
    <
  .40 every time we run these tests.

* Cost 2: wen we run the E2E tests it will take 22 min rather than 10 min

* Benefit 1: We will always be testing the autoscalar to make sure everything is working properly

* Compromise?: Should we run one set of models (say the SMT models) on the autoscalar and the NMT on my GPU?  That would save some time and most of the cost per run as well as continue to verify that the autoscalar is working properly.  The only meaningful difference would be the A100's - and they are the same ones we are using in the AQUA server.  I think we can get near 100% of the value with < 50% of the cost.

(If I could add)

  • Potential Benefit 2: It may be 22 min rather than 10 min, but it ought to be consistent in a way the current set-up isn't. I.e., if there's a lot queued up (for some reason), we shouldn't have added wait times for jobs to finish. This might also help with some of the flakiness we've seen with, for example, the queue multiple E2E test.

@johnml1135
Copy link
Collaborator

Point taken. There is only one GPU available for running these tests (my 3090). If, on the other hand, we use the AQUA server for the CPU jobs and the autoscalar for the GPU jobs, we still can save some time and not run into the bottlenecks. We would not save as much money, but honestly, I am more desirous of the times being short than the cost being less.

In October, we had 27 commits to master (a pretty high month). Extrapolating that, we could spend 27120.20 (assuming half the time spent using GPUs) = $65 in total GPU cost for running these tests per year, hardly breaking the bank.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we run the CPU jobs on the AQuA server, might we still run into bottlenecks if the AQuA server is full?

Also, the CPU jobs are much cheaper, since I'm running them on an e2-standard-32 machine type rather than an accelerator optimized machine type. It currently has a spot price of $0.4607/hr (and $1.3147/hr on demand).

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)

@johnml1135
Copy link
Collaborator

The differences are likely less than the cost of us talking about it. Let's just do it.

@johnml1135 johnml1135 merged commit a179bb1 into main Dec 10, 2024
4 checks passed
@johnml1135 johnml1135 deleted the #527_autoscaler_test branch December 10, 2024 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

Run E2E tests with autoscaler and autoscaler.cpu_only
4 participants