Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracking] Improvements to polars-row #19903

Open
3 of 9 tasks
coastalwhite opened this issue Nov 21, 2024 · 0 comments
Open
3 of 9 tasks

[Tracking] Improvements to polars-row #19903

coastalwhite opened this issue Nov 21, 2024 · 0 comments

Comments

@coastalwhite
Copy link
Collaborator

coastalwhite commented Nov 21, 2024

The polars-row subcrate provides the row-encoding is used by Polars. It is now in a very bare state and should be improved to provide better possibilities in sorting, joins and the new streaming engine.

Here is a list of improvements that I would like to look into:

  • Fully implement nested encoding / decoding (refactor: Implement nested row encoding / decoding #19874)
  • Improve pl.List encoding with continuation tokens instead of variable length encoding. This allows empty child encoding, removes the need for an intermediate buffer, and massively reduces the amount of space needed by the row encoding. I do need to verify that it fully works 😅. (perf: More efficient row encoding for pl.List #19907)
  • Properly implement Dictionary encoding (to be used by pl.Enum and pl.Categorical), using the method described by the arrow-row people. We need to investigate how to roundtrip this efficiently (e.g. with a bidirectional HashMap).
  • Implement optimizations for BinaryView to be similar to Dictionary when the cardinality of views is low. We need to find a way to estimate this cardinality. It might also be worth it to consider the average length of a view. This is for instance probably not really worth it if all views are inlinable anyway.
  • Implement optimizations for strings to not use the variable encoding but use properties of UTF-8 instead. (perf: Reduce the size of row encoding UTF-8 #19911)
  • Implement an optimization for Column encoding so that ScalarColumn and PartedColumn encoding becomes much cheaper.
  • Reduce the size of Boolean to at most one byte. (perf: Half the size of Booleans in row encoding #19927)
  • Introduce 2 loop systems for validity. Remove quadratic reserve!
  • Look into variable-length encoding for integers (perf: Add a VarInt encoding for the row encoding #19929).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant