Faster tests #438

dzuelke · 2020-10-09T04:08:49Z

This speeds up tests from the previous ~14 minutes to under 6 minutes by splitting up the tests into many smaller files, which we can run at higher concurrency (but not too much or we run into API rate limits), currently tuned to 16 processes so the slowest/longest test case (ci_spec.rb) is the upper limit of the total runtime and everything else just fits into the other 15 buckets.

No tests have actually changed except

New Relic, which is now no longer tested against all PHP versions, since what we're really testing is its config and the suppression of agent startup and telemetry during builds and dyno startup, and none of that is dependent on the PHP version;
FPM slowlog/terminate tests for 7.4, which by accident only ran against Apache before, but not Nginx.

Heroku CLI install is now done before test startup to overwrite the older Travis version, no need to do a ~30 second update that way, and auto-update gets disabled so the CLI doesn't have to check for updates since the install is brand new.

GUS-W-8192252

We can achieve better concurrency with smaller files, as they can be balanced over more processes.

we only need to test the behavior of detection and startup suppression, none of which depends on the PHP version, so we only run it once using the latest, rather than for each PHP series

The CI test is the slowest (about five minutes). We reserve one process for that, then two for each PHP series (up to seven, for heroku-16), and one spare so parallel_tests can distribute well. Too many processes and we'll quickly run into API rate limits.

Travis has the CLI pre-installed, but it's not always quite up-to-date, leading to hundreds of warning messages from the CLI that an update is available. Hatchet itself won't install the CLI if it's already there, so we're doing it by hand here, and using the non-Ubuntu install, because that overwrites the Travis provided version in /usr/local/bin (the Ubuntu installer version wouldn't take precedence in /usr/bin), and because it's much faster. This is also much faster than doing a "heroku update", which downloads half of the Internet in the form of NPM metadata. We're also setting an env var to disable auto updates for the CLI, since we just did a fresh install of the latest release.

edmorley

Awesome speed up! :-D

.travis.yml

edmorley · 2020-10-09T08:07:21Z

.travis.yml

 after_script:
 - bundle exec hatchet destroy --all
 - cat parallel_runtime_rspec.log | grep -E '^test/spec/[a-z0-9_/\.-]+\.rb:[0-9]+\.[0-9]+$' | sort
 env:
  global:
+  - HEROKU_DISABLE_AUTOUPDATE=1
  - HATCHET_RETRIES=3


I wonder if we should all try lowering HATCHET_RETRIES (or disabling retries entirely) for a bit, in order to flush out remaining Hatchet/platform flakiness? Otherwise many of the failures will go unnoticed in green CI runs (unless we happen to manually read every log). Yes it will mean more need to retry tests from time to time, but given the test end to end time is much lower now, that's less of a burden.

Hmh, maybe. @schneems ?

Since I started tracking flappy runs, I've already caught and squashed a number of issues even with the retry. You can see the tracking issue heroku/hatchet#120.

My ultimate goal with reducing flappiness is to help with productivity. It's frustrating to make a tiny doc commit or version bump, wait for tests...only to have them fail and need to be restarted. If you want to drop it temporarily on your repo for maybe a few weeks or a month, then feel free to go for it, but there are things like network glitches and other general distributed computing concerns that can never be truly sussed out and retries can help there.

I guess I would say: Go for it if you want. Know that I don't think it's a reasonable long term goal to disable all retries. I wouldn't want to mandate we reduce this value across the team (at least for now).

Yeah I wasn't saying we should mandate it. It was just that https://salesforce.quip.com/B8a9A6ZP6APx was scoring us based on how many intermittent tests we've reported - and I've not reported many not because I haven't gotten around to it, but because I haven't seen many. I wondered if the retries might be the reason why :-)

If you've got the energy for it, I would be interested to see how much of a difference it makes. It's a sound theory and would be interesting to share the results with the program after a period of trying it.

.travis.yml

edmorley · 2020-12-28T15:55:14Z

@dzuelke I've just discovered that there are actually two similarly named gems for splitting test runs for improved concurrency, parallel_tests and parallel_split_test, with the latter supporting splitting at the test rather than file level. Did you know about the latter? (Just wondering if there are issues that prevented using it here, which would have saved needing to split up the files so much)

dzuelke · 2021-01-06T17:33:36Z

Yeah @edmorley.

The reason I am not using the per-test one is that I need to control the state I share for several tests together... as in, I create and deploy an app, then have many test cases against it, and a lot of the tests are generated by loops etc.

It's easier to control this way how tests are grouped, as they're already running into Heroku API rate limits (which are bumped up already for that Hatchet account) with the wrong parallelization level.

dzuelke added 6 commits October 9, 2020 01:57

fix FPM slow log/terminate tests not using Nginx

d37a7fa

Split tests into many files

6a73292

We can achieve better concurrency with smaller files, as they can be balanced over more processes.

simplify New Relic tests

16ccddd

we only need to test the behavior of detection and startup suppression, none of which depends on the PHP version, so we only run it once using the latest, rather than for each PHP series

use run_multi for New Relic

8f43714

dzuelke requested a review from a team as a code owner October 9, 2020 04:08

edmorley approved these changes Oct 9, 2020

View reviewed changes

dzuelke merged commit 5c50dea into main Oct 9, 2020

dzuelke deleted the fastertests branch October 9, 2020 12:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster tests #438

Faster tests #438

dzuelke commented Oct 9, 2020 •

edited

Loading

edmorley left a comment

edmorley Oct 9, 2020

dzuelke Oct 9, 2020

schneems Oct 9, 2020

edmorley Oct 9, 2020

schneems Oct 9, 2020

edmorley commented Dec 28, 2020 •

edited

Loading

dzuelke commented Jan 6, 2021

Faster tests #438

Faster tests #438

Conversation

dzuelke commented Oct 9, 2020 • edited Loading

edmorley left a comment

Choose a reason for hiding this comment

edmorley Oct 9, 2020

Choose a reason for hiding this comment

dzuelke Oct 9, 2020

Choose a reason for hiding this comment

schneems Oct 9, 2020

Choose a reason for hiding this comment

edmorley Oct 9, 2020

Choose a reason for hiding this comment

schneems Oct 9, 2020

Choose a reason for hiding this comment

edmorley commented Dec 28, 2020 • edited Loading

dzuelke commented Jan 6, 2021

dzuelke commented Oct 9, 2020 •

edited

Loading

edmorley commented Dec 28, 2020 •

edited

Loading