-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make DataFrame.any_rowwise top-level, rename to _horizontal #324
Conversation
df = df.filter( | ||
ns.any_horizontal(*[df.col(col_name) > 0 for col_name in df.column_names]), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with #326, this could be simplified to
df = df.filter( | |
ns.any_horizontal(*[df.col(col_name) > 0 for col_name in df.column_names]), | |
) | |
df = df.filter(ns.any_horizontal(*[col > 0 for col in df.columns_iter()])) |
@@ -290,3 +290,55 @@ def date(year: int, month: int, day: int) -> Scalar: | |||
... ) | |||
>>> df.filter(mask) | |||
""" | |||
|
|||
|
|||
def any_horizontal(*columns: Column, skip_nulls: bool | Scalar = True) -> Column: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does skip_nulls
need to accept a Scalar
here? I figure it's only parameters that feed into data paths as opposed to things that change control flow where we'd want to accept Scalar
s?
I.E. could a lazy framework meaningfully handle it if a lazy scalar was given here without forcing materialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in theory, yes:
df: DataFrame
pdx = df.__dataframe_namespace__()
pdx.any_horizontal(*[col > 0 for col in df.columns_iter()], skip_nulls=df.col('a').get_value(0))
in practice, the above example is so contrived, I'm happy to just keep it as skip_nulls: bool
here 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could Polars lazy meaningfully handle passing that "LazyScalar" into the skip_nulls
parameter without forcing computation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at the moment sum
doesn't have skip_nulls
there are places in the api that do handle what look like lazy scalars*, like
In [47]: df = pl.LazyFrame({'a': [date(2020, 1, 1), date(2020, 1, 2)], 'b': ['1d', '2d']})
In [48]: df.with_columns(pl.col('a').dt.offset_by(pl.col('b').gather(0)))
Out[48]: <LazyFrame [2 cols, {"a": Date, "b": Utf8}] at 0x7F7F36B92410>
In [49]: df.with_columns(pl.col('a').dt.offset_by(pl.col('b').gather(0))).collect()
Out[49]:
shape: (2, 2)
┌────────────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ date ┆ str │
╞════════════╪═════╡
│ 2020-01-02 ┆ 1d │
│ 2020-01-03 ┆ 2d │
└────────────┴─────┘
But realistically, there's probably never be any need for this, so it won't get added. So this reinforces your suggestion to just allow for booleans here, and keep it simple
*technically it's a 1-row column which can broadcast, there's no scalar class in Polars itself
thanks for your review! |
The current
any_rowwise
function isn't particularly useful for filtering:is undefined, because the parent dataframe of
(df > 0).any_rowwise()
isdf > 0
, notdf
Solution: introduce
namespace.any_rowwise
andnamespace.all_rowwise
. Also, rename to_horizontal
for familiarity to Polars users (I'm not aware of other names - in spark I think you're meant to do something like)
Then, you can write
Maybe, in a separate PR, we could add something like
ns.all()
for more easily doing some operation on all columns. But for now, this addresses the issueThis came up here https://github.com/skrub-data/skrub/pull/827/files#r1397599891
The same applies to
sorted_indices
andunique_indices
.As a design rule, I'd suggest not introducing
DataFrame
methods which returnColumn
s - such methods should be top-level functions which acceptColumn
s