[WIP] Interpolationjoiner dataframe api #827

jeromedockes · 2023-11-15T14:49:27Z

this changes the InterpolationJoiner to rely on the dataframe api (and some utilities added to skrub._dataframe) so that it works with polars

the tests are now parametrized with a fixture px that becomes pandas and polars

jeromedockes · 2023-11-15T15:00:53Z

In the utilities we added to skrub._dataframe, I believe make_dataframe, make_series, join, could use the dataframe api instead

jeromedockes · 2023-11-15T15:16:03Z

and it seems our oldest supported pandas version does not support the dataframe api?

jeromedockes · 2023-11-16T15:41:52Z

I bumped the pandas version just to see if the CI runs but having the dataframe api requires pandas 2.1.0 release notes which dates from august 2023

jeromedockes · 2023-11-17T10:32:43Z

@MarcoGorelli in case you have the time I'm sure you would have advice for better use of the dataframe API in this one!

MarcoGorelli · 2023-11-17T14:02:52Z

ooh, seeing you try this out has made my day! got some things I need to finish now but I'll take a careful look and see what we need to change upstream (I'm sure something will come up 😄 )

GaelVaroquaux · 2023-11-17T14:04:41Z

ooh, seeing you try this out has made my day!

I'm sooo happy about this PR, Marco! I love the way the support for polars is building in skrub

Vincent-Maladiere

Hey @jeromedockes, some first comments.
Overall, I'm quite concerned by the complexity this PR introduces. Sure, using the dataframe-api-compat will be challenging for readability, but we need to be extra careful not to introduce even more challenging code to read and understand.
If the dataframe-api-compat is not mature enough yet, it might be wiser to wait rather than to have to change a lot of logic three months later. LMKWYT.

Vincent-Maladiere · 2023-11-17T16:17:51Z

skrub/_dataframe/tests/test_all.py

+
+
+@pytest.fixture
+def df(px):


I'm -1 for having a fixture and a function name that also looks like the standard dataframe variable df. I'm confused when I read this.

ok but note fixture names are the names of parameters in the test functions, so in test functions df will be a dataframe (and we never explicitly call a fixture). what name would you prefer, a longer name like example_dataframe?

If that's not too much of a constraint, then yes, I would prefer dataframe or example_dataframe for instance.

Vincent-Maladiere · 2023-11-17T16:20:11Z

skrub/_dataframe/_common.py

+import enum
+
+
+class Selector(enum.Enum):


I'm in favor of avoiding enums and having strings instead for simplicity.

strings are already used for column names, so using a different type helps distinguish the 2. we could also have more complex selectors that support set operations as in polars but I thought it would be better to start with something simpler. we could also have 2 functions select and select_dtypes which both accept strings, although we might want support for selection by dtype in SelectCols transformers and we probably don't want to have a SelectColsByDtype separate transformer. how do you suggest we handle selection by dtypes?

Vincent-Maladiere · 2023-11-17T16:22:01Z

skrub/_dataframe/_polars.py

 def select(dataframe, columns):
-    return dataframe.select(columns)
+    return dataframe.select(_check_selector(columns))


I'm confused. Why is there a _check_selector function in _polars but not _pandas?

this helper wouldn't really be appropriate in pandas because in pandas depending on the selector we need to select with either [] or select_dtypes.
in polars we can just select() the output of _check_selector

Vincent-Maladiere · 2023-11-17T16:22:28Z

skrub/_dataframe/_polars.py

+
+
+def any_rowwise(dataframe):
+    return _collect(dataframe.select(pl.any_horizontal(pl.all()))).get_column("any")


This is not readable.

Vincent-Maladiere · 2023-11-17T16:24:09Z

skrub/_dataframe/_common.py

+    CATEGORICAL = enum.auto()
+
+
+def std(obj):


std is already the name of the standard deviation in all scientific modules. We must find another name for this function.

Vincent-Maladiere · 2023-11-17T16:38:26Z

skrub/_interpolation_joiner.py

        assignments = []
-        regression_table = aux_table.select_dtypes("number")
+        regression_table = ns.select(aux_table.dataframe, ns.Selector.NUMERIC)


As said earlier, a string (e.g., "numeric") would be nicer, IMHO.

the dataframe-api way to do this is

pdx = aux_table.__dataframe_namespace__() aux_table.select( *[ col.name for col in aux_table.columns_iter() if pdx.is_dtype(col.dtype, "numeric") ] )

Vincent-Maladiere · 2023-11-17T16:48:45Z

skrub/_interpolation_joiner.py

    estimator = clone(estimator)
-    kept_rows = target_table.notnull().all(axis=1).to_numpy()
+    ns = skrubns(target_table.dataframe)
+    kept_rows = ~(std(ns.any_rowwise(target_table.is_null().dataframe)).to_array())


This is hard to read, honestly.
Attempt to make this simpler:

ns, _ = get_df_namespace(target_table) target_table = to_standard_df(target_table) kept_rows = ns.any_rowwise( target_table.is_null().dataframe ) kept_rows = ~to_standard_df(kept_rows).to_array()

WDYT?

Hopefully any_rowwise will be implemented in the future of the dataframe-api-compat and this snippet will be simplified to:

kept_rows = ~target_table.is_null().any_rowwise().to_array()

Also, @MarcoGorelli, having all_rowwise and a notnull methods in dataframe-api-compat would help!

You can already use namespace.any_rowwise in dataframe-api-compat

for example, here's a function which keeps all rows where any element is greater than 0:

In [1]: import pandas as pd In [2]: df_pd = pd.DataFrame({"a": [-1, 1, 3], "b": [-2, -1, 8]}) In [3]: ...: def my_dataframe_agnostic_function(df): ...: df = df.__dataframe_consortium_standard__() ...: pdx = df.__dataframe_namespace__() ...: df = df.filter(pdx.any_horizontal(*[col>0 for col in df.columns_iter()])) ...: return df.dataframe ...: In [4]: my_dataframe_agnostic_function(df_pd) Out[4]: a b 0 1 -1 1 3 8

I need to get this changed in the standard as DataFrame.any_rowwise doesn't really work if columns are backed by expressions (there's a reason it's a toplevel function in Polars)

Might be nice to have a way to do *[df.col(col_name)>0 for col_name in df.column_names]) more conveniently though. In Polars there's pl.all() - need to think of something

btw, I think here you just need DataFrame.drop_nulls? that's in both the Standard and in dataframe-api-compat now

WDYT?

sure, I can add names for the intermediate steps. probably not the same name kept_rows for the kept rows and its complementary so maybe kept_rows = ~discarded_rows or something like that

for example, here's a function which keeps all rows where any element is greater than 0:

when I run this I get AttributeError: 'Namespace' object has no attribute 'any_rowwise'; do I need to install the development version of one of the packages? I did notice the error message about using namespace.any_rowwise but couldn't find it in the API specification

here I would like to build a boolean mask that I also need to apply to y which is a numpy array, so any_rowwise seemed like a good option

we only got this merged into the standard yesterday data-apis/dataframe-api#324

it's called any_horizontal - if you install the latest dataframe-api-compat you should have it

Vincent-Maladiere · 2023-11-17T17:11:48Z

skrub/_interpolation_joiner.py

    key_values = key_values[kept_rows]
-    Y = target_table.to_numpy()[kept_rows]
+    target_table = target_table.persist()
+    Y = target_table.to_array(None)[kept_rows]


For readability

Suggested change

Y = target_table.to_array(None)[kept_rows]

Y = target_table.to_array(dtype=None)[kept_rows]

I also need to change this in the standard, I think the output dtype should always be inferrable from the dtypes of the columns

Vincent-Maladiere · 2023-11-17T17:12:57Z

skrub/_interpolation_joiner.py

+        if Y_values.ndim == 1:
+            Y_values = Y_values[:, None]
+        cols = [
+            api_ns.column_from_1d_array(y.astype(type(y[0])), name=c, dtype=dt)


This is hard to read.

Vincent-Maladiere · 2023-11-17T17:22:19Z

skrub/_interpolation_joiner.py

+                        api_ns.column_from_sequence(
+                            itertools.repeat(None, res["shape"][0]), name=c, dtype=dt
+                        )
+                        for c, dt in res["schema"].items()


For readability

Suggested change

for c, dt in res["schema"].items()

for name, dtype in res["schema"].items()

MarcoGorelli · 2023-11-17T17:48:10Z

If the dataframe-api-compat is not mature enough yet, it might be wiser to wait rather than to have to change a lot of logic three months later

FWIW I'm aiming to tag the first non-beta version by February data-apis/dataframe-api#319. Til then, I'm extremely happy if people experiment with it, but I would caution against putting a lot of work into using it

jeromedockes · 2024-05-28T15:58:08Z

after all we won't be using the dataframe API for this so it will be easier to just start a new branch

jeromedockes added 7 commits November 15, 2023 14:05

start implementing interpolation join with dataframe api

912d545

store schema rather than columns

cd3e942

failed results with right schema

3c120b6

fix preserving pandas index

9e56305

update tests

b0b7c89

add conftest

adda6e4

add dataframe_api_compat

668b849

update doctests

6729905

bump pandas version to have dataframe api support

9736d0d

jeromedockes added 6 commits November 16, 2023 16:43

update setup.cfg

418f3f3

upgrade scikit-learn

b5f2c56

remove unused to_numpy

b7b0a59

add more tests

c8b24f3

update changelog

a9ac96d

column namespace

2e6e19c

jeromedockes added 2 commits November 17, 2023 11:47

more tests

1be7187

rename concatenate -> concat_horizontal

c09f577

Vincent-Maladiere reviewed Nov 17, 2023

View reviewed changes

jeromedockes mentioned this pull request Feb 14, 2024

RFC, WIP: demo: use dataframe api in datetimeencoder #786

Closed

jeromedockes closed this May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Interpolationjoiner dataframe api #827

[WIP] Interpolationjoiner dataframe api #827

jeromedockes commented Nov 15, 2023

jeromedockes commented Nov 15, 2023

jeromedockes commented Nov 15, 2023

jeromedockes commented Nov 16, 2023

jeromedockes commented Nov 17, 2023

MarcoGorelli commented Nov 17, 2023

GaelVaroquaux commented Nov 17, 2023 via email

Vincent-Maladiere left a comment

Vincent-Maladiere Nov 17, 2023

jeromedockes Nov 23, 2023

Vincent-Maladiere Nov 23, 2023

Vincent-Maladiere Nov 17, 2023

jeromedockes Nov 23, 2023

Vincent-Maladiere Nov 17, 2023

jeromedockes Nov 23, 2023

Vincent-Maladiere Nov 17, 2023

Vincent-Maladiere Nov 17, 2023

Vincent-Maladiere Nov 17, 2023

MarcoGorelli Nov 23, 2023 •

edited

Loading

Vincent-Maladiere Nov 17, 2023

MarcoGorelli Nov 17, 2023 •

edited

Loading

MarcoGorelli Nov 17, 2023

jeromedockes Nov 23, 2023

jeromedockes Nov 23, 2023

jeromedockes Nov 23, 2023

MarcoGorelli Nov 23, 2023

Vincent-Maladiere Nov 17, 2023

MarcoGorelli Nov 17, 2023

Vincent-Maladiere Nov 17, 2023

Vincent-Maladiere Nov 17, 2023

MarcoGorelli commented Nov 17, 2023 •

edited

Loading

jeromedockes commented May 28, 2024



		def any_rowwise(dataframe):
		return _collect(dataframe.select(pl.any_horizontal(pl.all()))).get_column("any")

	Y = target_table.to_array(None)[kept_rows]
	Y = target_table.to_array(dtype=None)[kept_rows]

	for c, dt in res["schema"].items()
	for name, dtype in res["schema"].items()

[WIP] Interpolationjoiner dataframe api #827

[WIP] Interpolationjoiner dataframe api #827

Conversation

jeromedockes commented Nov 15, 2023

jeromedockes commented Nov 15, 2023

jeromedockes commented Nov 15, 2023

jeromedockes commented Nov 16, 2023

jeromedockes commented Nov 17, 2023

MarcoGorelli commented Nov 17, 2023

GaelVaroquaux commented Nov 17, 2023 via email

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli Nov 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli Nov 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli commented Nov 17, 2023 • edited Loading

jeromedockes commented May 28, 2024

MarcoGorelli Nov 23, 2023 •

edited

Loading

MarcoGorelli Nov 17, 2023 •

edited

Loading

MarcoGorelli commented Nov 17, 2023 •

edited

Loading