-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster tests #438
Faster tests #438
Conversation
We can achieve better concurrency with smaller files, as they can be balanced over more processes.
we only need to test the behavior of detection and startup suppression, none of which depends on the PHP version, so we only run it once using the latest, rather than for each PHP series
The CI test is the slowest (about five minutes). We reserve one process for that, then two for each PHP series (up to seven, for heroku-16), and one spare so parallel_tests can distribute well. Too many processes and we'll quickly run into API rate limits.
Travis has the CLI pre-installed, but it's not always quite up-to-date, leading to hundreds of warning messages from the CLI that an update is available. Hatchet itself won't install the CLI if it's already there, so we're doing it by hand here, and using the non-Ubuntu install, because that overwrites the Travis provided version in /usr/local/bin (the Ubuntu installer version wouldn't take precedence in /usr/bin), and because it's much faster. This is also much faster than doing a "heroku update", which downloads half of the Internet in the form of NPM metadata. We're also setting an env var to disable auto updates for the CLI, since we just did a fresh install of the latest release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome speed up! :-D
after_script: | ||
- bundle exec hatchet destroy --all | ||
- cat parallel_runtime_rspec.log | grep -E '^test/spec/[a-z0-9_/\.-]+\.rb:[0-9]+\.[0-9]+$' | sort | ||
env: | ||
global: | ||
- HEROKU_DISABLE_AUTOUPDATE=1 | ||
- HATCHET_RETRIES=3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should all try lowering HATCHET_RETRIES
(or disabling retries entirely) for a bit, in order to flush out remaining Hatchet/platform flakiness? Otherwise many of the failures will go unnoticed in green CI runs (unless we happen to manually read every log). Yes it will mean more need to retry tests from time to time, but given the test end to end time is much lower now, that's less of a burden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmh, maybe. @schneems ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I started tracking flappy runs, I've already caught and squashed a number of issues even with the retry. You can see the tracking issue heroku/hatchet#120.
My ultimate goal with reducing flappiness is to help with productivity. It's frustrating to make a tiny doc commit or version bump, wait for tests...only to have them fail and need to be restarted. If you want to drop it temporarily on your repo for maybe a few weeks or a month, then feel free to go for it, but there are things like network glitches and other general distributed computing concerns that can never be truly sussed out and retries can help there.
I guess I would say: Go for it if you want. Know that I don't think it's a reasonable long term goal to disable all retries. I wouldn't want to mandate we reduce this value across the team (at least for now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I wasn't saying we should mandate it. It was just that https://salesforce.quip.com/B8a9A6ZP6APx was scoring us based on how many intermittent tests we've reported - and I've not reported many not because I haven't gotten around to it, but because I haven't seen many. I wondered if the retries might be the reason why :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you've got the energy for it, I would be interested to see how much of a difference it makes. It's a sound theory and would be interesting to share the results with the program after a period of trying it.
@dzuelke I've just discovered that there are actually two similarly named gems for splitting test runs for improved concurrency, parallel_tests and parallel_split_test, with the latter supporting splitting at the test rather than file level. Did you know about the latter? (Just wondering if there are issues that prevented using it here, which would have saved needing to split up the files so much) |
Yeah @edmorley. The reason I am not using the per-test one is that I need to control the state I share for several tests together... as in, I create and deploy an app, then have many test cases against it, and a lot of the tests are generated by loops etc. It's easier to control this way how tests are grouped, as they're already running into Heroku API rate limits (which are bumped up already for that Hatchet account) with the wrong parallelization level. |
This speeds up tests from the previous ~14 minutes to under 6 minutes by splitting up the tests into many smaller files, which we can run at higher concurrency (but not too much or we run into API rate limits), currently tuned to 16 processes so the slowest/longest test case (
ci_spec.rb
) is the upper limit of the total runtime and everything else just fits into the other 15 buckets.No tests have actually changed except
Heroku CLI install is now done before test startup to overwrite the older Travis version, no need to do a ~30 second update that way, and auto-update gets disabled so the CLI doesn't have to check for updates since the install is brand new.
GUS-W-8192252