-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Interpolationjoiner dataframe api #827
Changes from all commits
912d545
cd3e942
3c120b6
9e56305
b0b7c89
adda6e4
668b849
6729905
9736d0d
418f3f3
b5f2c56
b7b0a59
c8b24f3
a9ac96d
2e6e19c
1be7187
c09f577
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
import pandas | ||
import pytest | ||
|
||
DATAFRAME_MODULES = [pandas] | ||
try: | ||
import polars | ||
|
||
DATAFRAME_MODULES.append(polars) | ||
except ImportError: | ||
pass | ||
|
||
|
||
@pytest.fixture(params=DATAFRAME_MODULES) | ||
def px(request): | ||
return request.param |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
from ._common import Selector, std, stdns | ||
from ._namespace import get_df_namespace, skrubns | ||
|
||
__all__ = ["get_df_namespace", "skrubns", "std", "stdns", "Selector"] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
import enum | ||
|
||
|
||
class Selector(enum.Enum): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm in favor of avoiding enums and having strings instead for simplicity. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. strings are already used for column names, so using a different type helps distinguish the 2. we could also have more complex selectors that support set operations as in polars but I thought it would be better to start with something simpler. we could also have 2 functions |
||
ALL = enum.auto() | ||
NONE = enum.auto() | ||
NUMERIC = enum.auto() | ||
CATEGORICAL = enum.auto() | ||
|
||
|
||
def std(obj): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
try: | ||
return obj.__dataframe_consortium_standard__() | ||
except AttributeError: | ||
return obj.__column_consortium_standard__() | ||
|
||
|
||
def stdns(obj): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto for this name |
||
try: | ||
return obj.__dataframe_consortium_standard__().__dataframe_namespace__() | ||
except AttributeError: | ||
return obj.__column_consortium_standard__().__column_namespace__() |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -97,3 +97,8 @@ def get_df_namespace(*dfs): | |
"Only Pandas or Polars dataframes are currently supported, " | ||
f"got {modules=!r}." | ||
) | ||
|
||
|
||
def skrubns(*dataframes): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this function worth its cognitive load? I fear it will make our code less readable for small gains. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fair enough I can remove it. the gain is we almost never need the second item returned by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, it makes sense. We could instead split the |
||
ns, _ = get_df_namespace(*dataframes) | ||
return ns |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,6 +13,23 @@ | |
|
||
from skrub._utils import atleast_1d_or_none | ||
|
||
from ._common import Selector | ||
|
||
__all__ = [ | ||
"POLARS_SETUP", | ||
"make_dataframe", | ||
"make_series", | ||
"aggregate", | ||
"join", | ||
"split_num_categ_cols", | ||
"select", | ||
"drop", | ||
"Selector", | ||
"concat_horizontal", | ||
"any_rowwise", | ||
"to_pandas", | ||
] | ||
|
||
|
||
def make_dataframe(X, index=None): | ||
"""Convert an dictionary of columns into a Polars dataframe. | ||
|
@@ -263,5 +280,47 @@ | |
return num_cols, categ_cols | ||
|
||
|
||
def _check_selector(columns): | ||
if not isinstance(columns, Selector): | ||
return columns | ||
if columns is Selector.ALL: | ||
return cs.all() | ||
elif columns is Selector.NONE: | ||
return [] | ||
elif columns is Selector.NUMERIC: | ||
return cs.numeric() | ||
elif columns is Selector.CATEGORICAL: | ||
return cs.string(include_categorical=True) | ||
# we have covered all items in the enumeration | ||
assert False | ||
|
||
|
||
def select(dataframe, columns): | ||
return dataframe.select(columns) | ||
return dataframe.select(_check_selector(columns)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm confused. Why is there a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this helper wouldn't really be appropriate in pandas because in pandas depending on the selector we need to select with either |
||
|
||
|
||
def drop(dataframe, columns): | ||
return dataframe.drop(_check_selector(columns)) | ||
|
||
|
||
def any_rowwise(dataframe): | ||
return _collect(dataframe.select(pl.any_horizontal(pl.all()))).get_column("any") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not readable. |
||
|
||
|
||
def concat_horizontal(dataframe, *other_dataframes): | ||
return pl.concat( | ||
[_collect(dataframe)] + [_collect(df) for df in other_dataframes], | ||
how="horizontal", | ||
) | ||
|
||
|
||
def _collect(dataframe): | ||
if hasattr(dataframe, "collect"): | ||
dataframe = dataframe.collect() | ||
return dataframe | ||
|
||
|
||
def to_pandas(dataframe): | ||
if hasattr(dataframe, "collect"): | ||
dataframe = dataframe.collect() | ||
return dataframe.to_pandas() |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
import pytest | ||
|
||
from skrub._dataframe import Selector, skrubns | ||
|
||
|
||
@pytest.fixture | ||
def df(px): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm -1 for having a fixture and a function name that also looks like the standard dataframe variable There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok but note fixture names are the names of parameters in the test functions, so in test functions There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If that's not too much of a constraint, then yes, I would prefer |
||
return px.DataFrame( | ||
{"ID": [2, 3, 7], "name": ["ab", "cd", "01"], "temp": [20.3, 40.9, 11.5]} | ||
) | ||
|
||
|
||
def test_select(df): | ||
ns = skrubns(df) | ||
assert list(ns.select(df, []).columns) == [] | ||
assert list(ns.select(df, ["name"]).columns) == ["name"] | ||
assert list(ns.select(df, Selector.ALL).columns) == list(df.columns) | ||
assert list(ns.select(df, Selector.NONE).columns) == [] | ||
assert list(ns.select(df, Selector.NUMERIC).columns) == ["ID", "temp"] | ||
assert list(ns.select(df, Selector.CATEGORICAL).columns) == ["name"] | ||
|
||
|
||
def test_drop(df): | ||
ns = skrubns(df) | ||
assert list(ns.drop(df, []).columns) == list(df.columns) | ||
assert list(ns.drop(df, ["name"]).columns) == ["ID", "temp"] | ||
assert list(ns.drop(df, Selector.ALL).columns) == [] | ||
assert list(ns.drop(df, Selector.NONE).columns) == list(df.columns) | ||
assert list(ns.drop(df, Selector.NUMERIC).columns) == ["name"] | ||
assert list(ns.drop(df, Selector.CATEGORICAL).columns) == ["ID", "temp"] | ||
|
||
|
||
def test_concat_horizontal(df): | ||
ns = skrubns(df) | ||
df1 = ( | ||
df.__dataframe_consortium_standard__() | ||
.rename_columns({c: f"{c}_1" for c in df.columns}) | ||
.dataframe | ||
) | ||
out = ns.concat_horizontal(df) | ||
assert list(out.columns) == list(df.columns) | ||
out = ns.concat_horizontal(df, df1) | ||
assert list(out.columns) == list(df.columns) + list(df1.columns) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
from skrub._dataframe import skrubns, std, stdns | ||
|
||
|
||
def test_std(px): | ||
df = px.DataFrame({"A": [1, 2]}) | ||
assert hasattr(std(df), "dataframe") | ||
assert hasattr(stdns(df), "dataframe_from_columns") | ||
ns = skrubns(df) | ||
s = ns.make_series([1, 2], name="A") | ||
assert hasattr(std(s), "column") | ||
assert hasattr(stdns(s), "dataframe_from_columns") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this bump for scikit-learn? Is it related to Pandas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes for the dataframe api we need a recent pandas, and due to breaking changes in pandas for the recent pandas version we need a recent scikit-learn version. I think this means we may have to wait a bit before using the dataframe api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use an earlier version, but you'll have to write your own helper function to opt in to it, like:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, thanks! that will definitely help us start using the dataframe api sooner. I think it's not a problem to require a recent version of the dataframe-api-compat package, whereas we should support a few older releases of pandas and scikit-learn