Make DataFrame.any_rowwise top-level, rename to _horizontal #324

MarcoGorelli · 2023-11-17T18:28:05Z

The current any_rowwise function isn't particularly useful for filtering:

df: DataFrame
df.filter((df > 0).any_rowwise())

is undefined, because the parent dataframe of (df > 0).any_rowwise() is df > 0, not df

Solution: introduce namespace.any_rowwise and namespace.all_rowwise. Also, rename to _horizontal for familiarity to Polars users (I'm not aware of other names - in spark I think you're meant to do something like

from functools import reduce
predicate = reduce(lambda a, b: a | b, [df[x] != 0 for x in cols])

)

Then, you can write

    >>> df: DataFrame
    >>> ns = df.__dataframe_namespace__()
    >>> mask = ns.all_horizontal(
    ...     *[df.col(col_name) > 0 for col_name in df.column_names()]
    ... )
    >>> df = df.filter(mask)

Maybe, in a separate PR, we could add something like ns.all() for more easily doing some operation on all columns. But for now, this addresses the issue

This came up here https://github.com/skrub-data/skrub/pull/827/files#r1397599891

The same applies to sorted_indices and unique_indices.

As a design rule, I'd suggest not introducing DataFrame methods which return Columns - such methods should be top-level functions which accept Columns

MarcoGorelli · 2023-11-21T14:33:37Z

spec/API_specification/examples/06_horizontal_functions.py

+    df = df.filter(
+        ns.any_horizontal(*[df.col(col_name) > 0 for col_name in df.column_names]),
+    )


with #326, this could be simplified to

Suggested change

df = df.filter(

ns.any_horizontal(*[df.col(col_name) > 0 for col_name in df.column_names]),

)

df = df.filter(ns.any_horizontal(*[col > 0 for col in df.columns_iter()]))

kkraus14 · 2023-11-17T18:59:47Z

spec/API_specification/dataframe_api/__init__.py

@@ -290,3 +290,55 @@ def date(year: int, month: int, day: int) -> Scalar:
    ... )
    >>> df.filter(mask)
    """
+
+
+def any_horizontal(*columns: Column, skip_nulls: bool | Scalar = True) -> Column:


Does skip_nulls need to accept a Scalar here? I figure it's only parameters that feed into data paths as opposed to things that change control flow where we'd want to accept Scalars?

I.E. could a lazy framework meaningfully handle it if a lazy scalar was given here without forcing materialization?

in theory, yes:

df: DataFrame pdx = df.__dataframe_namespace__() pdx.any_horizontal(*[col > 0 for col in df.columns_iter()], skip_nulls=df.col('a').get_value(0))

in practice, the above example is so contrived, I'm happy to just keep it as skip_nulls: bool here 😄

Could Polars lazy meaningfully handle passing that "LazyScalar" into the skip_nulls parameter without forcing computation?

at the moment sum doesn't have skip_nulls

there are places in the api that do handle what look like lazy scalars*, like

In [47]: df = pl.LazyFrame({'a': [date(2020, 1, 1), date(2020, 1, 2)], 'b': ['1d', '2d']}) In [48]: df.with_columns(pl.col('a').dt.offset_by(pl.col('b').gather(0))) Out[48]: <LazyFrame [2 cols, {"a": Date, "b": Utf8}] at 0x7F7F36B92410> In [49]: df.with_columns(pl.col('a').dt.offset_by(pl.col('b').gather(0))).collect() Out[49]: shape: (2, 2) ┌────────────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ date ┆ str │ ╞════════════╪═════╡ │ 2020-01-02 ┆ 1d │ │ 2020-01-03 ┆ 2d │ └────────────┴─────┘

But realistically, there's probably never be any need for this, so it won't get added. So this reinforces your suggestion to just allow for booleans here, and keep it simple

*technically it's a 1-row column which can broadcast, there's no scalar class in Polars itself

MarcoGorelli · 2023-11-22T20:10:41Z

thanks for your review!

make horizontal functions top-level

83e3323

MarcoGorelli added the API design label Nov 17, 2023

MarcoGorelli requested review from kkraus14, rgommers, shwina and cbourjau and removed request for kkraus14 November 20, 2023 10:46

MarcoGorelli added 2 commits November 21, 2023 14:30

make sorted_indices and unique_indices top-level

cef273a

typing

b65e193

MarcoGorelli commented Nov 21, 2023

View reviewed changes

MarcoGorelli added 3 commits November 21, 2023 14:34

lint

0b26370

docs rendering

bc3aa1a

docs rendering

4dfb945

kkraus14 reviewed Nov 21, 2023

View reviewed changes

MarcoGorelli added 2 commits November 22, 2023 09:16

simplify

7e92840

typing

d7bedb3

kkraus14 approved these changes Nov 22, 2023

View reviewed changes

MarcoGorelli merged commit 27d5fc4 into data-apis:main Nov 22, 2023
4 checks passed

MarcoGorelli mentioned this pull request Nov 23, 2023

[WIP] Interpolationjoiner dataframe api skrub-data/skrub#827

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make DataFrame.any_rowwise top-level, rename to _horizontal #324

Make DataFrame.any_rowwise top-level, rename to _horizontal #324

MarcoGorelli commented Nov 17, 2023 •

edited

Loading

MarcoGorelli Nov 21, 2023

kkraus14 Nov 17, 2023

MarcoGorelli Nov 22, 2023

kkraus14 Nov 22, 2023

MarcoGorelli Nov 22, 2023

MarcoGorelli commented Nov 22, 2023

Make DataFrame.any_rowwise top-level, rename to _horizontal #324

Make DataFrame.any_rowwise top-level, rename to _horizontal #324

Conversation

MarcoGorelli commented Nov 17, 2023 • edited Loading

MarcoGorelli Nov 21, 2023

Choose a reason for hiding this comment

kkraus14 Nov 17, 2023

Choose a reason for hiding this comment

MarcoGorelli Nov 22, 2023

Choose a reason for hiding this comment

kkraus14 Nov 22, 2023

Choose a reason for hiding this comment

MarcoGorelli Nov 22, 2023

Choose a reason for hiding this comment

MarcoGorelli commented Nov 22, 2023

MarcoGorelli commented Nov 17, 2023 •

edited

Loading