-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigation: Are we using the right statistics to show improvement in our benchmarks? #688
Comments
I think that a useful invariant would be that if we have two independent changes with "headline" numbers |
I agree with what you are saying, but in practice it does seem like one change could hide in another, e.g. both changes create better cache locality, and when you put them together you don't get that win "twice". I think a looser invariant is that if |
I created longitudinal plots that show 4 different aggregation methods together:
Here is that on 3.14.x with JIT against 3.13.0b3 (main Linux machine only): And here is that with the classic 3.11.x against 3.10.4 (main Linux machine only): It's nice to see that they all are more-or-less parallel with some offset, and while you can see HPT reducing variation (as designed) the other alternatives aren't uselessly noisy either. It's tempting to use "overall mean" because it's the most favourable, but that feels like cherry-picking. We don't quite have all the data to measure Brandt's suggestion. However, we can test the following for each of the methods: for 2 adjacent commits A and B and a common base C, if B:A > 1, B:C must be > A:C. The only method where this doesn't hold true is the overall mean method. Lastly, I experimented with bringing the same nuance we have in the benchmarking plots to the longitudinal ones -- it's possible to show violin plots for each of the entries like this (again 3.11.x vs. 3.10.4): (Imagine the x axis is dates -- it's a little tricky to make that work...) This plot is interesting because it clearly shows where the "mean" improvement is but also that there are a significant number of specific use cases where you can do much better than that -- I do sort of find it helpful to see that. Anyway, there's still more things to look at here -- just wanted to provide a braindump and get some feedback in the meantime. |
Based on a conversation I had with @brandtbucher, I feel it's time to reinvestigate the various methods we use to arrive at an overall improvement number for our benchmarks. To summarize, we currently provide:
There are a few puzzling things:
We are now in a good position with a lot of data collected over a long period. I should play with the different statistical methods we have to see which are truly the most valuable to meet the following (which may require different solutions):
a) understand if a change is helpful
b) show how far we've come
c) reduce measurement noise
The text was updated successfully, but these errors were encountered: