Add Serialisation/Deserialisation #11

domhauton · 2018-12-10T11:46:43Z

In order to aggregate tdigests from multiple services it's necessary to serialise and deserialise the data to be sent.

This change implements the serialisation from AVLTree in tdunning's original implementation with some added safety features.

Both libraries can now process and swap data using the small and large encoding types.

https://github.com/tdunning/t-digest/blob/e8ca8479b7c98deb4bfe381a4e465e33e7063262/core/src/main/java/com/tdunning/math/stats/AVLTreeDigest.java#L340

The raw data used to generate the test data with tdunning's library is provided.

welch · 2018-12-10T19:09:26Z

very cool, I will have some time later this week to look this over.
serial interop with another library is very appealing.

as with the previous PR, I'm not a fan of defending against bad input
when doing so embeds new assumptions, so the overflow protection
will be a hard sell.

domhauton · 2018-12-11T11:55:50Z

I have changed the assumptions to be based on the input buffer length. This will always be true unless the data is corrupted, in which case the protection is valid and it requires no extra configuration from the user.

It's valuable to have this to prevent OOM problems and wasted compute on receiving corrupted data. It's fairly likely to happen within t-digest because of the different encoding between histogram types. If you put an encoded MergingDigest into the AVLTreeDigest decoder it reads an incorrect (usually very high) centroid count and proceed to fail or OOM. If collecting digests from thousands of services written in varying languages it makes this kind of mistake very likely.

It would be great to save people from production outages due to crossed wires if we can.

Dominic Hauton added 13 commits December 5, 2018 15:50

Fix bug where mean is returned as string

9614ed4

Fix for discrete digests

d154408

Fix for discrete digests

0086e14

Fix for discrete digests

ab071fb

Add test for serialisation

a35c1b7

Include test data

396a2f4

Add cap on histogram size to prevent overflow

fbc9dbb

Ready up for PR

ff93505

Ready up for PR

284b31d

Revert other fixes

35f57f8

Update README and bump package version

951005b

Added new dependency to README

6fb7900

Improve auto centroid detection

d2e34f7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Serialisation/Deserialisation #11

Add Serialisation/Deserialisation #11

domhauton commented Dec 10, 2018

welch commented Dec 10, 2018

domhauton commented Dec 11, 2018

Add Serialisation/Deserialisation #11

Are you sure you want to change the base?

Add Serialisation/Deserialisation #11

Conversation

domhauton commented Dec 10, 2018

welch commented Dec 10, 2018

domhauton commented Dec 11, 2018