Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster tests #438

Merged
merged 6 commits into from
Oct 9, 2020
Merged

Faster tests #438

merged 6 commits into from
Oct 9, 2020

Conversation

dzuelke
Copy link
Contributor

@dzuelke dzuelke commented Oct 9, 2020

This speeds up tests from the previous ~14 minutes to under 6 minutes by splitting up the tests into many smaller files, which we can run at higher concurrency (but not too much or we run into API rate limits), currently tuned to 16 processes so the slowest/longest test case (ci_spec.rb) is the upper limit of the total runtime and everything else just fits into the other 15 buckets.

No tests have actually changed except

  1. New Relic, which is now no longer tested against all PHP versions, since what we're really testing is its config and the suppression of agent startup and telemetry during builds and dyno startup, and none of that is dependent on the PHP version;
  2. FPM slowlog/terminate tests for 7.4, which by accident only ran against Apache before, but not Nginx.

Heroku CLI install is now done before test startup to overwrite the older Travis version, no need to do a ~30 second update that way, and auto-update gets disabled so the CLI doesn't have to check for updates since the install is brand new.

GUS-W-8192252

We can achieve better concurrency with smaller files, as they can be balanced over more processes.
we only need to test the behavior of detection and startup suppression, none of which depends on the PHP version, so we only run it once using the latest, rather than for each PHP series
The CI test is the slowest (about five minutes). We reserve one process for that, then two for each PHP series (up to seven, for heroku-16), and one spare so parallel_tests can distribute well.

Too many processes and we'll quickly run into API rate limits.
Travis has the CLI pre-installed, but it's not always quite up-to-date, leading to hundreds of warning messages from the CLI that an update is available.

Hatchet itself won't install the CLI if it's already there, so we're doing it by hand here, and using the non-Ubuntu install, because that overwrites the Travis provided version in /usr/local/bin (the Ubuntu installer version wouldn't take precedence in /usr/bin), and because it's much faster.

This is also much faster than doing a "heroku update", which downloads half of the Internet in the form of NPM metadata.

We're also setting an env var to disable auto updates for the CLI, since we just did a fresh install of the latest release.
@dzuelke dzuelke requested a review from a team as a code owner October 9, 2020 04:08
Copy link
Member

@edmorley edmorley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome speed up! :-D

.travis.yml Show resolved Hide resolved
.travis.yml Show resolved Hide resolved
.travis.yml Show resolved Hide resolved
after_script:
- bundle exec hatchet destroy --all
- cat parallel_runtime_rspec.log | grep -E '^test/spec/[a-z0-9_/\.-]+\.rb:[0-9]+\.[0-9]+$' | sort
env:
global:
- HEROKU_DISABLE_AUTOUPDATE=1
- HATCHET_RETRIES=3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should all try lowering HATCHET_RETRIES (or disabling retries entirely) for a bit, in order to flush out remaining Hatchet/platform flakiness? Otherwise many of the failures will go unnoticed in green CI runs (unless we happen to manually read every log). Yes it will mean more need to retry tests from time to time, but given the test end to end time is much lower now, that's less of a burden.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmh, maybe. @schneems ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I started tracking flappy runs, I've already caught and squashed a number of issues even with the retry. You can see the tracking issue heroku/hatchet#120.

My ultimate goal with reducing flappiness is to help with productivity. It's frustrating to make a tiny doc commit or version bump, wait for tests...only to have them fail and need to be restarted. If you want to drop it temporarily on your repo for maybe a few weeks or a month, then feel free to go for it, but there are things like network glitches and other general distributed computing concerns that can never be truly sussed out and retries can help there.

I guess I would say: Go for it if you want. Know that I don't think it's a reasonable long term goal to disable all retries. I wouldn't want to mandate we reduce this value across the team (at least for now).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wasn't saying we should mandate it. It was just that https://salesforce.quip.com/B8a9A6ZP6APx was scoring us based on how many intermittent tests we've reported - and I've not reported many not because I haven't gotten around to it, but because I haven't seen many. I wondered if the retries might be the reason why :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you've got the energy for it, I would be interested to see how much of a difference it makes. It's a sound theory and would be interesting to share the results with the program after a period of trying it.

.travis.yml Show resolved Hide resolved
.travis.yml Show resolved Hide resolved
@dzuelke dzuelke merged commit 5c50dea into main Oct 9, 2020
@dzuelke dzuelke deleted the fastertests branch October 9, 2020 12:54
@edmorley
Copy link
Member

edmorley commented Dec 28, 2020

@dzuelke I've just discovered that there are actually two similarly named gems for splitting test runs for improved concurrency, parallel_tests and parallel_split_test, with the latter supporting splitting at the test rather than file level. Did you know about the latter? (Just wondering if there are issues that prevented using it here, which would have saved needing to split up the files so much)

@dzuelke
Copy link
Contributor Author

dzuelke commented Jan 6, 2021

Yeah @edmorley.

The reason I am not using the per-test one is that I need to control the state I share for several tests together... as in, I create and deploy an app, then have many test cases against it, and a lot of the tests are generated by loops etc.

It's easier to control this way how tests are grouped, as they're already running into Heroku API rate limits (which are bumped up already for that Hatchet account) with the wrong parallelization level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants