Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Ergonomic casts from Categorical to Enum #20645

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Jan 9, 2025

ref: #19868 (comment).

Allow for stable ergonomic casts from Categorical to Enum.

If no categories are defined, we read the Categorical rev_map and normalise to a lexically sorted Enum (which ensures that if the same categories appear in a different order they will return the same Enum).

Example

import polars as pl

df = pl.DataFrame({
  "id": [0, 1, 2, 3],
  "foo": pl.Series(["aaa", "bbb", "aaa", "bbb"], dtype=pl.Categorical),
  "bar": pl.Series(["zzz", "xxx", "xxx", "yyy"], dtype=pl.Categorical),
  "baz": pl.Series(["xxx", "yyy", "zzz", "xxx"], dtype=pl.Categorical),
})

Before:
(note that we have to sort the categories to ensure a stable Enum)

df.cast({
  "foo": pl.Enum(sorted(df["foo"].cat.get_categories())),
  "bar": pl.Enum(sorted(df["bar"].cat.get_categories())),
  "baz": pl.Enum(sorted(df["baz"].cat.get_categories())),
})

After:
(no boilerplate, and the resulting enums will be stable/sorted)

df.cast({pl.Categorical: pl.Enum})

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jan 9, 2025
@ritchie46
Copy link
Member

I am not sure about this. The fields in an enum belong to the data type. In a categorical they are inferred at runtime.

Copy link

codecov bot commented Jan 9, 2025

Codecov Report

Attention: Patch coverage is 93.75000% with 1 line in your changes missing coverage. Please review.

Project coverage is 79.01%. Comparing base (6cd9988) to head (90521e2).

Files with missing lines Patch % Lines
...-core/src/chunked_array/logical/categorical/mod.rs 92.85% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main   #20645   +/-   ##
=======================================
  Coverage   79.01%   79.01%           
=======================================
  Files        1557     1557           
  Lines      220552   220560    +8     
  Branches     2514     2514           
=======================================
+ Hits       174274   174284   +10     
+ Misses      45705    45703    -2     
  Partials      573      573           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 9, 2025

I am not sure about this. The fields in an enum belong to the data type. In a categorical they are inferred at runtime.

Yup; it's an ergonomic improvement to something that may not actually be a good idea, heh. I'll park in Draft so you can have a think about it; not a big PR - feel free to close if you prefer not to merge ;)

@alexander-beedie alexander-beedie marked this pull request as draft January 9, 2025 15:32
@coastalwhite
Copy link
Collaborator

I don't think "lexically sorted enum"s exist. That will panic in many places.

@mcrumiller
Copy link
Contributor

FYI there's a little relevant discussion in #14952.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 9, 2025

I don't think "lexically sorted enum"s exist. That will panic in many places.

As in, the Enum gets created with the values having been sorted (as the same Categorical values coming from several sources could have a very different physical ordering, but -if we were to merge this- we'd want the Enum equivalent to resolve consistently). I could have worded that better ;)

@s-banach
Copy link
Contributor

s-banach commented Jan 9, 2025

I just wanted to freeze an existing categorical encoding, e.g. the output of cut(). Definitely didn’t imagine it would change the ordering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants