-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Interpolationjoiner dataframe api #827
[WIP] Interpolationjoiner dataframe api #827
Conversation
In the utilities we added to |
and it seems our oldest supported pandas version does not support the dataframe api? |
I bumped the pandas version just to see if the CI runs but having the dataframe api requires pandas 2.1.0 release notes which dates from august 2023 |
@MarcoGorelli in case you have the time I'm sure you would have advice for better use of the dataframe API in this one! |
ooh, seeing you try this out has made my day! got some things I need to finish now but I'll take a careful look and see what we need to change upstream (I'm sure something will come up 😄 ) |
ooh, seeing you try this out has made my day!
I'm sooo happy about this PR, Marco! I love the way the support for polars is building in skrub
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jeromedockes, some first comments.
Overall, I'm quite concerned by the complexity this PR introduces. Sure, using the dataframe-api-compat will be challenging for readability, but we need to be extra careful not to introduce even more challenging code to read and understand.
If the dataframe-api-compat is not mature enough yet, it might be wiser to wait rather than to have to change a lot of logic three months later. LMKWYT.
|
||
|
||
@pytest.fixture | ||
def df(px): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm -1 for having a fixture and a function name that also looks like the standard dataframe variable df
. I'm confused when I read this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok but note fixture names are the names of parameters in the test functions, so in test functions df
will be a dataframe (and we never explicitly call a fixture). what name would you prefer, a longer name like example_dataframe
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's not too much of a constraint, then yes, I would prefer dataframe
or example_dataframe
for instance.
import enum | ||
|
||
|
||
class Selector(enum.Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of avoiding enums and having strings instead for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strings are already used for column names, so using a different type helps distinguish the 2. we could also have more complex selectors that support set operations as in polars but I thought it would be better to start with something simpler. we could also have 2 functions select
and select_dtypes
which both accept strings, although we might want support for selection by dtype in SelectCols
transformers and we probably don't want to have a SelectColsByDtype
separate transformer. how do you suggest we handle selection by dtypes?
def select(dataframe, columns): | ||
return dataframe.select(columns) | ||
return dataframe.select(_check_selector(columns)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. Why is there a _check_selector
function in _polars
but not _pandas
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this helper wouldn't really be appropriate in pandas because in pandas depending on the selector we need to select with either []
or select_dtypes
.
in polars we can just select()
the output of _check_selector
|
||
|
||
def any_rowwise(dataframe): | ||
return _collect(dataframe.select(pl.any_horizontal(pl.all()))).get_column("any") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not readable.
CATEGORICAL = enum.auto() | ||
|
||
|
||
def std(obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std
is already the name of the standard deviation in all scientific modules. We must find another name for this function.
assignments = [] | ||
regression_table = aux_table.select_dtypes("number") | ||
regression_table = ns.select(aux_table.dataframe, ns.Selector.NUMERIC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said earlier, a string (e.g., "numeric"
) would be nicer, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the dataframe-api way to do this is
pdx = aux_table.__dataframe_namespace__()
aux_table.select(
*[
col.name
for col in aux_table.columns_iter()
if pdx.is_dtype(col.dtype, "numeric")
]
)
estimator = clone(estimator) | ||
kept_rows = target_table.notnull().all(axis=1).to_numpy() | ||
ns = skrubns(target_table.dataframe) | ||
kept_rows = ~(std(ns.any_rowwise(target_table.is_null().dataframe)).to_array()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hard to read, honestly.
Attempt to make this simpler:
ns, _ = get_df_namespace(target_table)
target_table = to_standard_df(target_table)
kept_rows = ns.any_rowwise(
target_table.is_null().dataframe
)
kept_rows = ~to_standard_df(kept_rows).to_array()
WDYT?
Hopefully any_rowwise will be implemented in the future of the dataframe-api-compat and this snippet will be simplified to:
kept_rows = ~target_table.is_null().any_rowwise().to_array()
Also, @MarcoGorelli, having all_rowwise
and a notnull
methods in dataframe-api-compat would help!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can already use namespace.any_rowwise
in dataframe-api-compat
for example, here's a function which keeps all rows where any element is greater than 0:
In [1]: import pandas as pd
In [2]: df_pd = pd.DataFrame({"a": [-1, 1, 3], "b": [-2, -1, 8]})
In [3]:
...: def my_dataframe_agnostic_function(df):
...: df = df.__dataframe_consortium_standard__()
...: pdx = df.__dataframe_namespace__()
...: df = df.filter(pdx.any_horizontal(*[col>0 for col in df.columns_iter()]))
...: return df.dataframe
...:
In [4]: my_dataframe_agnostic_function(df_pd)
Out[4]:
a b
0 1 -1
1 3 8
I need to get this changed in the standard as DataFrame.any_rowwise
doesn't really work if columns are backed by expressions (there's a reason it's a toplevel function in Polars)
Might be nice to have a way to do *[df.col(col_name)>0 for col_name in df.column_names])
more conveniently though. In Polars there's pl.all()
- need to think of something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, I think here you just need DataFrame.drop_nulls
? that's in both the Standard and in dataframe-api-compat
now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT?
sure, I can add names for the intermediate steps. probably not the same name kept_rows
for the kept rows and its complementary so maybe kept_rows = ~discarded_rows
or something like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example, here's a function which keeps all rows where any element is greater than 0:
when I run this I get AttributeError: 'Namespace' object has no attribute 'any_rowwise'
; do I need to install the development version of one of the packages? I did notice the error message about using namespace.any_rowwise but couldn't find it in the API specification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here I would like to build a boolean mask that I also need to apply to y
which is a numpy array, so any_rowwise
seemed like a good option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we only got this merged into the standard yesterday data-apis/dataframe-api#324
it's called any_horizontal
- if you install the latest dataframe-api-compat
you should have it
key_values = key_values[kept_rows] | ||
Y = target_table.to_numpy()[kept_rows] | ||
target_table = target_table.persist() | ||
Y = target_table.to_array(None)[kept_rows] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability
Y = target_table.to_array(None)[kept_rows] | |
Y = target_table.to_array(dtype=None)[kept_rows] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also need to change this in the standard, I think the output dtype
should always be inferrable from the dtypes of the columns
if Y_values.ndim == 1: | ||
Y_values = Y_values[:, None] | ||
cols = [ | ||
api_ns.column_from_1d_array(y.astype(type(y[0])), name=c, dtype=dt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hard to read.
api_ns.column_from_sequence( | ||
itertools.repeat(None, res["shape"][0]), name=c, dtype=dt | ||
) | ||
for c, dt in res["schema"].items() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability
for c, dt in res["schema"].items() | |
for name, dtype in res["schema"].items() |
FWIW I'm aiming to tag the first non-beta version by February data-apis/dataframe-api#319. Til then, I'm extremely happy if people experiment with it, but I would caution against putting a lot of work into using it |
after all we won't be using the dataframe API for this so it will be easier to just start a new branch |
this changes the InterpolationJoiner to rely on the dataframe api (and some utilities added to
skrub._dataframe
) so that it works with polarsthe tests are now parametrized with a fixture
px
that becomespandas
andpolars