-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Serialisation/Deserialisation #11
base: master
Are you sure you want to change the base?
Conversation
very cool, I will have some time later this week to look this over. as with the previous PR, I'm not a fan of defending against bad input |
I have changed the assumptions to be based on the input buffer length. This will always be true unless the data is corrupted, in which case the protection is valid and it requires no extra configuration from the user. It's valuable to have this to prevent OOM problems and wasted compute on receiving corrupted data. It's fairly likely to happen within t-digest because of the different encoding between histogram types. If you put an encoded MergingDigest into the AVLTreeDigest decoder it reads an incorrect (usually very high) centroid count and proceed to fail or OOM. If collecting digests from thousands of services written in varying languages it makes this kind of mistake very likely. It would be great to save people from production outages due to crossed wires if we can. |
In order to aggregate tdigests from multiple services it's necessary to serialise and deserialise the data to be sent.
This change implements the serialisation from
AVLTree
in tdunning's original implementation with some added safety features.Both libraries can now process and swap data using the small and large encoding types.
https://github.com/tdunning/t-digest/blob/e8ca8479b7c98deb4bfe381a4e465e33e7063262/core/src/main/java/com/tdunning/math/stats/AVLTreeDigest.java#L340
The raw data used to generate the test data with tdunning's library is provided.