-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Woodwork 0.18.0 compatibility #3700
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3700 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 339 339
Lines 34307 34431 +124
=======================================
+ Hits 34179 34304 +125
+ Misses 128 127 -1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
else: | ||
y_ww = y | ||
return self.fit(X_ww, y_ww).transform(X_ww, y_ww) | ||
return self.fit(X, y).transform(X, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both fit
and transform
already have infer_feature_types
calls, these felt redundant
@@ -311,5 +312,8 @@ def transform(self, X, y=None): | |||
|
|||
if cleaned_y is not None: | |||
cleaned_y = cleaned_y["target"] | |||
cleaned_y = ww.init_series(cleaned_y) | |||
|
|||
cleaned_x.ww.init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No prior ww initialization existed for cleaned_X
, and cleaned_y
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any columns whose data types we know for certain at this point, so we can pass them to init
and reduce the amount of type inference?
y_t = ww.init_series( | ||
y_t, | ||
logical_type=y_ww.ww.logical_type, | ||
logical_type=new_logical_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y
no longer needs to be IntegerNullable
post imputation
@@ -37,7 +37,7 @@ outputs: | |||
- requirements-parser >=0.2.0 | |||
- shap >=0.40.0 | |||
- texttable >=1.6.2 | |||
- woodwork >=0.16.2, < 0.17.0 | |||
- woodwork >=0.18.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supporting minimum of woodwork==0.18.0
; I wanted to make sure the replace_nan
optimizations were supported at a minimum but feel free to push back on this
@@ -170,6 +170,7 @@ def fit(self, X, y=None): | |||
ValueError: If y was not passed in. | |||
""" | |||
if X is not None: | |||
X = downcast_nullable_types(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Downcasting because ARIMA doesn't support nullables
@@ -133,7 +134,7 @@ def fit(self, X, y=None): | |||
cat_cols.remove(col) | |||
bool_cols.append(col) | |||
|
|||
nan_ratio = X.ww.describe().loc["nan_count"] / X.shape[0] | |||
nan_ratio = X.isna().sum() / X.shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ww.describe
is a comprehensive function that calculates a wide array of statistics. Here we're only interested in the nan count that Woodwork calculates using this. Changing this led to a large fit
time improvement.
The only difference in this implementation is that nan LatLong values aren't supported which doesn't seem like a big factor to me especially since we don't have an inference function for latlongs yet
X_no_all_null.ww[numeric_col] = init_series( | ||
imputed[numeric_col], | ||
logical_type="Double", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initializing each of these imputed columns as a series with a Double
logical type (because they belong to X_numeric.columns
, led to another large fit
time improvement
else: | ||
y_ww = y | ||
return self.fit(X_ww, y_ww).transform(X_ww, y_ww) | ||
return self.fit(X, y).transform(X, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
infer_feature_types
and checks for y
were being made in fit
and transform
already, so this felt superfluous
for null_col in X.ww.select(IntegerNullable).columns | ||
}, | ||
) | ||
X.ww.init(schema=X.ww.schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping X.ww.init(schema=X.ww.schema)
inside the conditional led to a large improvement in fit
time
X = infer_feature_types(X) | ||
if not isinstance(X, pd.DataFrame): | ||
X = infer_feature_types(X) | ||
X = X.select_dtypes(exclude=["datetime"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saves a call to infer_feature_types
since the call is also made in fit
and transform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving a few questions/suggestions for now, will come back to finish review later!
evalml/pipelines/components/transformers/samplers/base_sampler.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good, just had a few questions/comments but I don't think anything's blocking. Not looking forward to handling the merge conflicts with my hardening PR, haha
# TODO: Added this in because numpy's unique() does not support pandas.NA | ||
try: | ||
self._classes_ = list(ww.init_series(np.unique(y))) | ||
except TypeError as e: | ||
if "boolean value of NA is ambiguous" in str(e): | ||
self._classes_ = y.unique() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does y.unique()
work in all cases? I'd rather keep things simpler if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was added just to replace np.unique
in cases of pd.NA
, otherwise the behaviour should be the same. @chukarsten was that the intention?
X = downcast_int_nullable_to_double(X) | ||
X = X.fillna(X.mean()) | ||
X, y = self._manage_woodwork(X, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have a guarantee here that X has woodwork information until after the call to manage_woodwork
. Was this call added to work with the fillna
or something later on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this call was added to work with fillna
but also more generally because ARIMA
won't accept Int64
dtypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. It may be safer to move the call to manage_woodwork
above this regardless, if I'm not missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hesitant about moving it to manage_woodwork
because that affects every instance where it's called. But I believe we're in the process of unifying nullable and non-nullable types so this won't be an issue soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry sorry, I just meant calling _manage_woodwork
before calling downcast_int_nullable_to_double
within this file, not within the _manage_woodwork
function itself! Apologies if that was unclear.
@@ -311,5 +312,8 @@ def transform(self, X, y=None): | |||
|
|||
if cleaned_y is not None: | |||
cleaned_y = cleaned_y["target"] | |||
cleaned_y = ww.init_series(cleaned_y) | |||
|
|||
cleaned_x.ww.init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any columns whose data types we know for certain at this point, so we can pass them to init
and reduce the amount of type inference?
two_distinct_with_nulls_y_ww.ww.init() | ||
two_distinct_with_nulls_y_ww = ww.init_series(two_distinct_with_nulls_y_ww) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between these two functions? Why use one vs the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because series.ww.init()
doesn’t actually transform the underlying logical type as WoodworkColumnAccessor
is being called, not WoodworkTableAccessor
since this is a series
and not a Dataframe
.
Per here, “In cases where the LogicalType requires the Series dtype to change, a helper function ww.init_series must be used. This function will return a new Series object with Woodwork initialized and the dtype of the series changed to match the physical type of the LogicalType.”
Since the series is originally float
due to the null
present, it needs to have its physical type properly inferred and changed to match the IntegerNullable
dtype of Int64
.
return X | ||
|
||
|
||
def downcast_int_nullable_to_double(X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add a separate function here? Isn't this just a subset of the downcast_int_nullable_to_double
behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two functions provide different use cases currently, I've filed an issue here!
X_df = X_df.astype( | ||
{"int col": float}, | ||
) # Convert to float as the imputer will do this as we're requesting the mean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need this casting, median is fine staying as an integer type!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe median
can end up imputing a float between integer values so to play it safe, it's being cast as such in the SimpleImputer
for both mean
and median
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - amazing work @ParthivNaresh
try: | ||
np_series = np_series.astype("int64") | ||
except TypeError: | ||
np_series = ww_series.astype(float).to_numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this used in case there are NaNs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so but @chukarsten might know better
new_ltypes_for_bool_nullable_cols = { | ||
col: "Boolean" for col in X_bool_nullable_cols | ||
} | ||
X_int_nullable_cols = X_schema._filter_cols(include=["IntegerNullable"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just use the downcasting function here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that there's some issue with booleans being cast to floats when using downcast_nullable_types
. I filed this so we can get some consistency in behaviour using a single downcast function.
5ce24ce
to
1d41462
Compare
Fixes #3652.