-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get rid of test front-loading in the duration_based_chunks algo #27
Conversation
Nope, still fails. And if I change the test durations in that test case to all 1 except the last one which gets 100, all of the tests end up in the first group and I get 6 empty groups. |
How would you suggest changing the implementation @alexforencich? I'm not seeing the issue 🙂 |
So I did not run the test correctly (have to actually install the package before you can run pytest). After fixing that problem, the test case in the PR passes. However, it fails if I change it to:
In this case, only the first two groups are utilized and the rest are empty. |
It seems to me like the goal of this algorithm is to preserve the test ordering. So perhaps a better method would be to compute the cumulative sum into a new array, and then partition that on Incidentally, I think you can get slightly better performance out of least_duration by sorting the tests by duration first so they are processed from longest to shortest. However, the best option might be to do something like this: assign each test an ID number based on the original order, then sort based on duration, use the algorithm in least_duration to group the tests, then sort each group based on the ID. This will preserve the test order, but produce more optimal grouping, and there will be no issue with empty groups unless there are more groups than tests (and in that case, what you can do is enable some random test in all of the otherwise-empty groups). |
That almost seems like a feature, not a bug; but more importantly, is that something that was changed in #14? We discussed fixing the change in behavior and the possibility of guaranteeing that every group has a test in #26, but to be clear - this PR just tries to fix the first point. Would the test above behave differently in previous versions? |
Well, if it wasn't causing github actions to crash, I wouldn't care as much. But it is, so it's a bug. Looks like yes, the splitting algorithm was rewritten in #14. Looks like the previous version maybe was maybe duplicating tests in this case (not explicitly, but due to a side-effect of how it was written) so there weren't any empty groups, but this wasn't noticed since it didn't cause pytest to exit prematurely. |
@alexforencich you're right, that would be better. I just stuck with the current solution because that preserved relative order, which I thought was important for some. I'm happy to add this at a later stage. |
Out of curiosity, why do people care about ordering? I really only want my test groups to be as even as possible, because the longest group sets the total time of my pipeline. |
Just imagine you work for a company that has a large suite of tests and the corresponding technical debt. The tests already have order dependency (by accident) but it's too big a problem to fix all at once. |
Yep lack of test isolation is the only thing I can think of 🙂 |
Well, you still might have ordering dependency issues as you're not going to run all of the tests. For relative order, you can restore that by assigning monotonically increasing IDs before doing the partitioning, then sort each group by this ID before running the tests. |
Updated the PR now @alexforencich @jerry-git The
|
Definitely looks a lot better. Still greedy I think, but not sure if there is a nice way of improving that without making it a lot more complicated. At any rate, if you want the most even split, least_duration is probably a better choice. Next step is to add some "safety" code to go ensure that there is at least one test enabled in each group, even if there are fewer tests than groups, by enabling a test in empty groups. But that's not for this PR. |
@jerry-git this is ready to review if you get a chance 🙂 |
Ahoy! Sorry this took a while, I was on vacation mode 🍹 I'll have a look. |
durations["g"] = 1 | ||
durations["h"] = 1 | ||
|
||
items = [item(x) for x in ["a", "b", "c", "d", "e", "f", "g", "h"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be the same if the loop would be over sorted(durations.keys())
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean, sorry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think he's saying you can do items = sorted(durations.keys())
instead of duplicating all of the items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I'm basically asking if this would be the same:
items = [item(x) for x in sorted(durations.keys())]
IMO it would be better considering readability as then it's explicit that there's one item for each thingy in durations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copied in the example test from here and made sure we had one example that was extremely front-loaded and one that was extremely back-loaded. I can make it less explicit if you prefer.
(4, 1, "duration_based_chunks", ["test_1", "test_2", "test_3"]), | ||
(4, 2, "duration_based_chunks", ["test_4", "test_5"]), | ||
(4, 3, "duration_based_chunks", ["test_6"]), | ||
(4, 4, "duration_based_chunks", ["test_7", "test_8", "test_9", "test_10"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this mean that the expected durations for each group will be:
- 1: 3s (1+1+1)
- 2: 2s (1+1)
- 3: 2s (2)
- 4: 8s (2+2+2+2)
If so, it doesn't feel right 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1-5 are 1 second tests, while 5-10 are 2 second tests, so I believe the timings are
- 3 seconds
- 3 seconds
- 2 seconds
- 8 seconds
And yes this is weird, but it's also what you should expect if you want to stop adding tests before going over the threshold of 15/4=3,75 seconds, and keeping absolute order 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a very simple greedy algorithm; it's pretty much unavoidable that you're going to get strange far-from-optimal results like this. The least_durations algorithm is much better and IMO should be the default and this algorithm should simply be removed. But this algorithm does have the benefit of maintaining the sequence of tests, which I suppose could potentially be beneficial in some cases if there are ordering dependencies between tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, here's something to consider potentially: exceed the threshold, but keep track of a running total overage and use that when determining if you're over the threshold. (If you never exceed the threshold, then you're basically guaranteed to end up with the last group being larger than the rest) There are probably a few other heuristics that you could apply here to try to improve things a little bit. For instance, something like this: ignore overage if the slot is empty and unconditionally assign, if it isn't empty then check to see if the resulting overage is less than half of the time required for that test, if so pack it, otherwise bump it to the next group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The least_durations algorithm is much better and IMO should be the default and this algorithm should simply be removed
I have a use case which requires this plugin to support maintaining the order of the tests, which is why we need to keep duration_based_chunks
or something similar. However, I can consider changing the default but that'd probably bite some other users of this plugin as well so need to consider that carefully 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll close this for now then 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, without these changes, is duration_based_chunks still producing empty groups? Or did that get fixed in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes probably. I noticed there are a few conflicts, so perhaps there's been another fix merged, but that's what this was meant to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, well my point was that even if the splitting isn't as perfect as it could be, at least this fix doesn't result in empty groups, so it should be merged and then possibly improved, instead of being closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing. I've reopened it - feel free to make any additional changes you wish (cc @jerry-git) and merge. I'm on holiday now myself so likely won't be able to apply any code changes for a while.
groups = duration_based_chunks(8, items, durations) | ||
|
||
for i in range(7): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 8 above and range(7) here in the loop below? Could we somehow get rid of the magical numbers and use e.g. variables which names would help the reader to understand what's going on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use for g in groups:
instead?
for i in range(7): | ||
assert groups[i].selected != [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this same as assert all(group.selected for group in groups)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks like it would probably work; maybe just add a comment to indicate that we're making sure there aren't any empty groups. Or maybe explicitly check that the size of each group is nonzero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're making sure there aren't any empty groups ... check that the size of each group is nonzero
IMO all(group.selected for group in groups)
does both and is explicit enough so no comments or further explanations would be needed. Just my opinion ofc 🙂
Closing as stale, feel free to reopen 🙂 |
Should hopefully resolve #26