Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make DataFrame.any_rowwise top-level, rename to _horizontal #324

Merged
merged 8 commits into from
Nov 22, 2023

Conversation

MarcoGorelli
Copy link
Contributor

@MarcoGorelli MarcoGorelli commented Nov 17, 2023

The current any_rowwise function isn't particularly useful for filtering:

df: DataFrame
df.filter((df > 0).any_rowwise())

is undefined, because the parent dataframe of (df > 0).any_rowwise() is df > 0, not df

Solution: introduce namespace.any_rowwise and namespace.all_rowwise. Also, rename to _horizontal for familiarity to Polars users (I'm not aware of other names - in spark I think you're meant to do something like

from functools import reduce
predicate = reduce(lambda a, b: a | b, [df[x] != 0 for x in cols])

)

Then, you can write

    >>> df: DataFrame
    >>> ns = df.__dataframe_namespace__()
    >>> mask = ns.all_horizontal(
    ...     *[df.col(col_name) > 0 for col_name in df.column_names()]
    ... )
    >>> df = df.filter(mask)

Maybe, in a separate PR, we could add something like ns.all() for more easily doing some operation on all columns. But for now, this addresses the issue

This came up here https://github.com/skrub-data/skrub/pull/827/files#r1397599891


The same applies to sorted_indices and unique_indices.

As a design rule, I'd suggest not introducing DataFrame methods which return Columns - such methods should be top-level functions which accept Columns

@MarcoGorelli MarcoGorelli requested review from kkraus14, rgommers, shwina and cbourjau and removed request for kkraus14 November 20, 2023 10:46
Comment on lines +25 to +27
df = df.filter(
ns.any_horizontal(*[df.col(col_name) > 0 for col_name in df.column_names]),
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with #326, this could be simplified to

Suggested change
df = df.filter(
ns.any_horizontal(*[df.col(col_name) > 0 for col_name in df.column_names]),
)
df = df.filter(ns.any_horizontal(*[col > 0 for col in df.columns_iter()]))

@@ -290,3 +290,55 @@ def date(year: int, month: int, day: int) -> Scalar:
... )
>>> df.filter(mask)
"""


def any_horizontal(*columns: Column, skip_nulls: bool | Scalar = True) -> Column:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does skip_nulls need to accept a Scalar here? I figure it's only parameters that feed into data paths as opposed to things that change control flow where we'd want to accept Scalars?

I.E. could a lazy framework meaningfully handle it if a lazy scalar was given here without forcing materialization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory, yes:

df: DataFrame
pdx = df.__dataframe_namespace__()
pdx.any_horizontal(*[col > 0 for col in df.columns_iter()], skip_nulls=df.col('a').get_value(0))

in practice, the above example is so contrived, I'm happy to just keep it as skip_nulls: bool here 😄

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could Polars lazy meaningfully handle passing that "LazyScalar" into the skip_nulls parameter without forcing computation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the moment sum doesn't have skip_nulls

there are places in the api that do handle what look like lazy scalars*, like

In [47]: df = pl.LazyFrame({'a': [date(2020, 1, 1), date(2020, 1, 2)], 'b': ['1d', '2d']})

In [48]: df.with_columns(pl.col('a').dt.offset_by(pl.col('b').gather(0)))
Out[48]: <LazyFrame [2 cols, {"a": Date, "b": Utf8}] at 0x7F7F36B92410>

In [49]: df.with_columns(pl.col('a').dt.offset_by(pl.col('b').gather(0))).collect()
Out[49]:
shape: (2, 2)
┌────────────┬─────┐
│ ab   │
│ ------ │
│ datestr │
╞════════════╪═════╡
│ 2020-01-021d  │
│ 2020-01-032d  │
└────────────┴─────┘

But realistically, there's probably never be any need for this, so it won't get added. So this reinforces your suggestion to just allow for booleans here, and keep it simple

*technically it's a 1-row column which can broadcast, there's no scalar class in Polars itself

@MarcoGorelli
Copy link
Contributor Author

thanks for your review!

@MarcoGorelli MarcoGorelli merged commit 27d5fc4 into data-apis:main Nov 22, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants