#527 autoscaler test #563

mshannon-sil · 2024-12-05T22:14:56Z

This PR moves the E2E tests to the autoscaler. The testing step took 22m 43s (24m 51s total), compared to 10m 4s (12m 0s total) if not using the autoscaler. Initial startup time is really only a factor for the first NMT job and the first SMT job, since the autoscaler can reuse instances.

Also, the reason the environment variables I added have CLEARML in all caps was to distinguish that they're not an official environment variable from ClearML, which uses ClearML to prepend its own environment variables. But I can change the environment variable names to match or to something else entirely if that's preferable.

This change is

ddaspit

Why is there such a large difference in runtime?

Reviewed all commit messages.
Reviewable status: 0 of 2 files reviewed, all discussions resolved (waiting on @johnml1135)

Enkidu93 · 2024-12-06T15:30:08Z

Why is there such a large difference in runtime?

Reviewed all commit messages.
Reviewable status: 0 of 2 files reviewed, all discussions resolved (waiting on @johnml1135)

Is it because it's not running on John's smaller gpus now? What kind of gpus does the autoscaler spin up, Matthew?

mshannon-sil · 2024-12-06T16:58:17Z

We talked about it in our standup, recording here that the reason is that there is a startup cost associated with launching a GCP instance for the first SMT job as well as for the first NMT job. And the GPU type that the autoscaler spins up for the NMT jobs is an A100 40GB.

johnml1135 · 2024-12-09T15:19:40Z

Just saying it out loud:

Cost 1: We are using the spot price of around $1.50/hour for the A100 (and less for the CPU only)? Therefore we are paying < $.40 every time we run these tests.
Cost 2: wen we run the E2E tests it will take 22 min rather than 10 min
Benefit 1: We will always be testing the autoscalar to make sure everything is working properly
Compromise?: Should we run one set of models (say the SMT models) on the autoscalar and the NMT on my GPU? That would save some time and most of the cost per run as well as continue to verify that the autoscalar is working properly. The only meaningful difference would be the A100's - and they are the same ones we are using in the AQUA server. I think we can get near 100% of the value with < 50% of the cost.

Enkidu93 · 2024-12-09T15:26:59Z

Just saying it out loud:

* Cost 1: We are using the spot price of around 
    1.50
    
      /
    
    h
    o
    u
    r
    f
    o
    r
    t
    h
    e
    A
    100
    (
    a
    n
    d
    l
    e
    s
    s
    f
    o
    r
    t
    h
    e
    C
    P
    U
    o
    n
    l
    y
    )
    ?
    T
    h
    e
    r
    e
    f
    o
    r
    e
    w
    e
    a
    r
    e
    p
    a
    y
    i
    n
    g
    <
  .40 every time we run these tests.

* Cost 2: wen we run the E2E tests it will take 22 min rather than 10 min

* Benefit 1: We will always be testing the autoscalar to make sure everything is working properly

* Compromise?: Should we run one set of models (say the SMT models) on the autoscalar and the NMT on my GPU?  That would save some time and most of the cost per run as well as continue to verify that the autoscalar is working properly.  The only meaningful difference would be the A100's - and they are the same ones we are using in the AQUA server.  I think we can get near 100% of the value with < 50% of the cost.

(If I could add)

Potential Benefit 2: It may be 22 min rather than 10 min, but it ought to be consistent in a way the current set-up isn't. I.e., if there's a lot queued up (for some reason), we shouldn't have added wait times for jobs to finish. This might also help with some of the flakiness we've seen with, for example, the queue multiple E2E test.

johnml1135 · 2024-12-09T15:52:40Z

Point taken. There is only one GPU available for running these tests (my 3090). If, on the other hand, we use the AQUA server for the CPU jobs and the autoscalar for the GPU jobs, we still can save some time and not run into the bottlenecks. We would not save as much money, but honestly, I am more desirous of the times being short than the cost being less.

In October, we had 27 commits to master (a pretty high month). Extrapolating that, we could spend 27120.20 (assuming half the time spent using GPUs) = $65 in total GPU cost for running these tests per year, hardly breaking the bank.

Enkidu93

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)

mshannon-sil

If we run the CPU jobs on the AQuA server, might we still run into bottlenecks if the AQuA server is full?

Also, the CPU jobs are much cheaper, since I'm running them on an e2-standard-32 machine type rather than an accelerator optimized machine type. It currently has a spot price of $0.4607/hr (and $1.3147/hr on demand).

Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)

johnml1135 · 2024-12-10T14:28:02Z

The differences are likely less than the cost of us talking about it. Let's just do it.

mshannon-sil added 4 commits November 26, 2024 11:26

update E2E env vars for autoscaler tests

5a55be5

fix mistakes in env var update for docker-compose.yml

3173e0d

change ASPNET_CORE_ENVIRONMENT

3f7303b

fix env var name in ci-e2e.yml

50c8c32

mshannon-sil requested a review from johnml1135 December 5, 2024 22:14

mshannon-sil self-assigned this Dec 5, 2024

mshannon-sil added the ci label Dec 5, 2024

mshannon-sil linked an issue Dec 5, 2024 that may be closed by this pull request

Run E2E tests with autoscaler and autoscaler.cpu_only #527

Closed

ddaspit reviewed Dec 6, 2024

View reviewed changes

Enkidu93 reviewed Dec 9, 2024

View reviewed changes

mshannon-sil commented Dec 9, 2024

View reviewed changes

johnml1135 merged commit a179bb1 into main Dec 10, 2024
4 checks passed

johnml1135 deleted the #527_autoscaler_test branch December 10, 2024 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#527 autoscaler test #563

#527 autoscaler test #563

mshannon-sil commented Dec 5, 2024 •

edited by ddaspit

Loading

ddaspit left a comment

Enkidu93 commented Dec 6, 2024

mshannon-sil commented Dec 6, 2024

johnml1135 commented Dec 9, 2024

Enkidu93 commented Dec 9, 2024

johnml1135 commented Dec 9, 2024

Enkidu93 left a comment

mshannon-sil left a comment

johnml1135 commented Dec 10, 2024

#527 autoscaler test #563

#527 autoscaler test #563

Conversation

mshannon-sil commented Dec 5, 2024 • edited by ddaspit Loading

ddaspit left a comment

Choose a reason for hiding this comment

Enkidu93 commented Dec 6, 2024

mshannon-sil commented Dec 6, 2024

johnml1135 commented Dec 9, 2024

Enkidu93 commented Dec 9, 2024

johnml1135 commented Dec 9, 2024

Enkidu93 left a comment

Choose a reason for hiding this comment

mshannon-sil left a comment

Choose a reason for hiding this comment

johnml1135 commented Dec 10, 2024

mshannon-sil commented Dec 5, 2024 •

edited by ddaspit

Loading