Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement minimal datatype registry for cross-batch deduplication #1809

Closed
Tracked by #1898
teh-cmc opened this issue Apr 11, 2023 · 2 comments
Closed
Tracked by #1898

Implement minimal datatype registry for cross-batch deduplication #1809

teh-cmc opened this issue Apr 11, 2023 · 2 comments
Labels
🏹 arrow concerning arrow ⛃ re_datastore affects the datastore itself

Comments

@teh-cmc
Copy link
Member

teh-cmc commented Apr 11, 2023

Once #1805 is fixed, schema metadata won't be duplicated across multiple cells of the same table anymore, i.e. all the cells in an incoming batch will share the same schema.
That's a great start but it's far from enough: schema metadata will still be duplicated across tables/batches.

We need a way of deduplicating schema information on the server side (and why not the clients too while we're at it), e.g. by hashing DataTypes and making them available through a global registry so the deserializer can make sure to deduplicate the data on entry.

Of course another solution is to have the clients directly access the central schema registry on the server so they can only send hashes to begin with; but that's future work for when we need it.

@teh-cmc
Copy link
Member Author

teh-cmc commented Jan 23, 2024

With the work done in #4883, this is now irrelevant in the context of scalar time series.

The only very specific case where this would have any noticeable impact is when logging scalars with the batcher completely disabled so that each scalar allocates a different DataType... except it wouldn't, since the datatype in that case is flat and so any kind of heap deduplication is pointless; we would have to fundamentally change the definition of DataType so it takes less stack space to start with (see also #4883 (comment)).

Where this could potentially have an impact is when logging enum values whose schemas contain a lot of strings... but even then, those strings are stored in the heap part of the datatype, which is already deduplicated.
So you would have to specifically log small enum values with big schemas and do so with the batcher disabled.

That's pretty niche and outside the scope of this cycle.

@teh-cmc teh-cmc removed their assignment Jan 23, 2024
@teh-cmc
Copy link
Member Author

teh-cmc commented May 16, 2024

Between re_arrow2 doing heap deduplication and the upcoming work on dense chunks, this is now superfluous.

@teh-cmc teh-cmc closed this as not planned Won't fix, can't repro, duplicate, stale May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏹 arrow concerning arrow ⛃ re_datastore affects the datastore itself
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant