Implement minimal datatype registry for cross-batch deduplication #1809

teh-cmc · 2023-04-11T06:58:21Z

Once #1805 is fixed, schema metadata won't be duplicated across multiple cells of the same table anymore, i.e. all the cells in an incoming batch will share the same schema.
That's a great start but it's far from enough: schema metadata will still be duplicated across tables/batches.

We need a way of deduplicating schema information on the server side (and why not the clients too while we're at it), e.g. by hashing DataTypes and making them available through a global registry so the deserializer can make sure to deduplicate the data on entry.

Of course another solution is to have the clients directly access the central schema registry on the server so they can only send hashes to begin with; but that's future work for when we need it.

The text was updated successfully, but these errors were encountered:

teh-cmc · 2024-01-23T09:55:36Z

With the work done in #4883, this is now irrelevant in the context of scalar time series.

The only very specific case where this would have any noticeable impact is when logging scalars with the batcher completely disabled so that each scalar allocates a different DataType... except it wouldn't, since the datatype in that case is flat and so any kind of heap deduplication is pointless; we would have to fundamentally change the definition of DataType so it takes less stack space to start with (see also #4883 (comment)).

Where this could potentially have an impact is when logging enum values whose schemas contain a lot of strings... but even then, those strings are stored in the heap part of the datatype, which is already deduplicated.
So you would have to specifically log small enum values with big schemas and do so with the batcher disabled.

That's pretty niche and outside the scope of this cycle.

teh-cmc · 2024-05-16T16:54:30Z

Between re_arrow2 doing heap deduplication and the upcoming work on dense chunks, this is now superfluous.

teh-cmc added 🏹 arrow concerning arrow ⛃ re_datastore affects the datastore itself labels Apr 11, 2023

This was referenced Apr 18, 2023

wip: smol datatypes #1847

Closed

re_datastore: generalized deduplication #446

Open

Tracking issue: Datastore performance/QOL improvements #1898

Closed

This was referenced Oct 9, 2023

Dump the Arrow registry itself as part of the generated code #2368

Closed

Migrate from re_arrow2 to arrow #3741

Open

Formalize the meaning of fully-qualified type names #3661

Open

teh-cmc self-assigned this Jan 19, 2024

teh-cmc removed their assignment Jan 23, 2024

teh-cmc closed this as not planned Won't fix, can't repro, duplicate, stale May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement minimal datatype registry for cross-batch deduplication #1809

Implement minimal datatype registry for cross-batch deduplication #1809

teh-cmc commented Apr 11, 2023

teh-cmc commented Jan 23, 2024

teh-cmc commented May 16, 2024

Implement minimal datatype registry for cross-batch deduplication #1809

Implement minimal datatype registry for cross-batch deduplication #1809

Comments

teh-cmc commented Apr 11, 2023

teh-cmc commented Jan 23, 2024

teh-cmc commented May 16, 2024