Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Companies table and developers table have different data for the same period #51

Open
craigbox opened this issue Feb 1, 2024 · 8 comments
Assignees
Labels
blocked bug Something isn't working

Comments

@craigbox
Copy link

craigbox commented Feb 1, 2024

@lukaszgryglicki has regenerated our database as of ~15 minutes ago, so this data is as fresh as it comes.

The companies table reports the top 5 contributors to Istio in the last 12 months as:

Rank Company Contributions
1 Google LLC 12738
2 Solo.io 8402
3 DaoCloud Network Technology Co. Ltd. 8229
4 International Business Machines Corporation 7461
5 Huawei Technologies Co. Ltd 6593

However, if one exports the data from the Developer activity counts by company view for the same period, the summation is this:

Rank Company Contributions
1 Google LLC 12615
2 Solo.io 8605
3 DaoCloud Network Technology Co. Ltd. 7936
4 International Business Machines Corporation 7509
5 Huawei Technologies Co. Ltd 6605

Note how some companies show fewer contributions in the second list, and some have more.

Istio uses this data as part of its governance process, and last week, the order of the top 5 results shown here actually differed depending on which metric you used.

Can you help us understand why these values are different?

@craigbox craigbox added the bug Something isn't working label Feb 1, 2024
@lukaszgryglicki
Copy link
Member

This could be due to HLL (hyper log log) - I can change metrics (for Istio only) to use exact count distincts instead of approximate counts that HLL gives - but this will require creating custom SQLs just for Istio usage, can be done in a day or two, but not earlier than about week or two from now.

@lukaszgryglicki lukaszgryglicki self-assigned this Feb 1, 2024
@lukaszgryglicki
Copy link
Member

lukaszgryglicki commented May 7, 2024

Can you recheck and LMK if this is still needed? I've optimised some metrics recently and they no longer use HLL, so this might be Ok already. If not LMK, I'll iterate on this when I can.

@craigbox
Copy link
Author

craigbox commented May 16, 2024

Companies table:

Company Contributions
Solo.io 12905
Google LLC 9679
DaoCloud Network Technology Co. Ltd. 7427
International Business Machines Corporation 5980
Huawei Technologies Co. Ltd 5757
Microsoft Corporation 3182
Tetrate.io 2917
Ericsson 1448
Salesforce.com Inc. 1303
Red Hat Inc. 1141

Sum of developers table:

Company Sum of Contributions
Solo.io 12513
Google LLC 9518
DaoCloud Network Technology Co. Ltd. 7442
International Business Machines Corporation 5822
Huawei Technologies Co. Ltd 5375
Microsoft Corporation 3168
Tetrate.io 2981
Ericsson 1408
Salesforce.com Inc. 1309
Red Hat Inc. 1124

Closer, and current in the same order/ballpark.

(Edit: initial miscalculation around Google was my error.)

@lukaszgryglicki
Copy link
Member

I will TAL on Friday or Monday.

@craigbox
Copy link
Author

craigbox commented May 16, 2024 via email

@lukaszgryglicki
Copy link
Member

Hmm the first link is giving sum of all contributions (this one) while another) is giving values per developer and you are summing them manually, right?

I'll check if both use HLL or doth don't use it - actually I will also update to use exact counts in case of Istio - because HLL was used to save cycles, but it makes more sense in All CNCF instance which has a lot of data, and here we can use just exact counts approach 9as Istio isn't as huge as All CNCF instance) - let me dive into it - maybe query conditions are slightly different on those two dashboards too?

@lukaszgryglicki
Copy link
Member

One was using HLL while another not, I will sync them now and regenerate data, then I'll let you know when finished.

Also pls note that all statistics across DevStats are not calculated "on the fly" but synced at a given point in time and saved in tables (so later Grafana UI does just a simple select to those "calculated" tables) - if calculation for "last year" happened on different tome for two metrics - they can be slightly out of sync, but the difference shouldn't be hight - after this manual sync that I'll do now - they should be as close to each other as possible.

@lukaszgryglicki
Copy link
Member

I've regenerated data, I don't have a script to sum all developers to check those value, PTAL again pls. Hope this is OK now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants