Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Binary/String type #13459

Closed
ritchie46 opened this issue Jan 5, 2024 · 7 comments
Closed

New Binary/String type #13459

ritchie46 opened this issue Jan 5, 2024 · 7 comments
Assignees
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Milestone

Comments

@ritchie46
Copy link
Member

ritchie46 commented Jan 5, 2024

The goal is to replace the current Arrow (Large)String type with a string type that allows a union between an inlined small string and an offset to a string that is allocated somewhere else.

This would prevent the terrible performance we have when filtering/gathering large string data as that forces a copy of all bytes. Second this type also allows string interning. As duplicates can only be stored once in the buffer and then we can point to that string multiple times.

Relevant arrow discussion here: https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt

@ritchie46 ritchie46 added this to the 1.0.0 milestone Jan 5, 2024
@stinodego stinodego added the accepted Ready for implementation label Jan 5, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jan 5, 2024
@stinodego stinodego moved this from Ready to In progress in Backlog Jan 5, 2024
@ritchie46 ritchie46 added python Related to Python Polars rust Related to Rust Polars enhancement New feature or an improvement of an existing feature performance Performance issues or improvements labels Jan 6, 2024
@ritchie46 ritchie46 moved this from In progress to Done in Backlog Jan 22, 2024
@Steiniche
Copy link

Steiniche commented Feb 5, 2024

Hi @ritchie46 , I believe that this issue can be closed as the functionality is merged and released.

@ritchie46
Copy link
Member Author

Yeap, thanks

@adriangb
Copy link

adriangb commented Aug 8, 2024

Is this available in the Python API? I don't see any references to the type.

@ritchie46
Copy link
Member Author

It is available. Our string column type is backed by this. Polars keeps it simple, we have 1 string type, and you use it automatically.

@adriangb
Copy link

adriangb commented Aug 8, 2024

If I understand correctly this type can/should intern large strings, providing some level of "dictionary encoding".
On Polars 1.4.1 I tried comparing to a Categorical column and get a much smaller size for the categorical column:

import polars as pl

large_string = 'a' * 10_000
data = [large_string] * 100_000

df = pl.DataFrame({'x': data}, schema={'x': pl.String})
print(df.estimated_size())  # 1000000000

df = pl.DataFrame({'x': data}, schema={'x': pl.Categorical})
print(df.estimated_size())  # 410000

Is this expected? Am I misunderstanding the type?

@ritchie46
Copy link
Member Author

Yes we can. Though there is cost in interning checking upon construction. Currently we only do that if you extend internally. series.new_from_index, I believe.

Later we will also look into interning checking until a limited size is reached.

@adriangb
Copy link

adriangb commented Aug 8, 2024

Indeed:

import polars as pl

large_string = 'a' * 10_000
data = [large_string] * 100_000

df = pl.DataFrame({'x': data}, schema={'x': pl.String})
assert df.shape[0] == 100_000
print(df.estimated_size())  # 1000000000

df = pl.DataFrame({'x': data}, schema={'x': pl.Categorical})
assert df.shape[0] == 100_000
print(df.estimated_size())  # 410000

s = pl.Series('x', [large_string], pl.String)
s = s.new_from_index(0, len(data))
df = pl.DataFrame({'x': s})
assert df.shape[0] == 100_000
print(df.estimated_size())  # 10000

It's even better than a dictionary / categorical column!

Agreed it would be nice to heuristically do this when constructing, or at least to be able to force it on if I know my data has a lot of duplication of large strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

No branches or pull requests

4 participants