Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Woodwork 0.18.0 compatibility #3700

Merged
merged 1 commit into from
Sep 13, 2022
Merged

Woodwork 0.18.0 compatibility #3700

merged 1 commit into from
Sep 13, 2022

Conversation

ParthivNaresh
Copy link
Contributor

@ParthivNaresh ParthivNaresh commented Sep 7, 2022

Fixes #3652.

@codecov
Copy link

codecov bot commented Sep 7, 2022

Codecov Report

Merging #3700 (1d41462) into main (e93f513) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3700     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        339     339             
  Lines      34307   34431    +124     
=======================================
+ Hits       34179   34304    +125     
+ Misses       128     127      -1     
Impacted Files Coverage Δ
.../tests/component_tests/test_datetime_featurizer.py 100.0% <ø> (ø)
.../component_tests/test_drop_nan_rows_transformer.py 100.0% <ø> (ø)
evalml/tests/component_tests/test_utils.py 99.1% <ø> (-<0.1%) ⬇️
evalml/tests/data_checks_tests/test_data_checks.py 100.0% <ø> (ø)
...valml/tests/pipeline_tests/test_component_graph.py 99.9% <ø> (ø)
evalml/utils/__init__.py 100.0% <ø> (ø)
evalml/model_understanding/metrics.py 100.0% <100.0%> (ø)
evalml/pipelines/classification_pipeline.py 100.0% <100.0%> (ø)
...omponents/estimators/regressors/arima_regressor.py 100.0% <100.0%> (ø)
...onents/estimators/regressors/catboost_regressor.py 100.0% <100.0%> (ø)
... and 26 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

else:
y_ww = y
return self.fit(X_ww, y_ww).transform(X_ww, y_ww)
return self.fit(X, y).transform(X, y)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both fit and transform already have infer_feature_types calls, these felt redundant

@@ -311,5 +312,8 @@ def transform(self, X, y=None):

if cleaned_y is not None:
cleaned_y = cleaned_y["target"]
cleaned_y = ww.init_series(cleaned_y)

cleaned_x.ww.init()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No prior ww initialization existed for cleaned_X, and cleaned_y.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any columns whose data types we know for certain at this point, so we can pass them to init and reduce the amount of type inference?

y_t = ww.init_series(
y_t,
logical_type=y_ww.ww.logical_type,
logical_type=new_logical_type,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y no longer needs to be IntegerNullable post imputation

@@ -37,7 +37,7 @@ outputs:
- requirements-parser >=0.2.0
- shap >=0.40.0
- texttable >=1.6.2
- woodwork >=0.16.2, < 0.17.0
- woodwork >=0.18.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supporting minimum of woodwork==0.18.0; I wanted to make sure the replace_nan optimizations were supported at a minimum but feel free to push back on this

@@ -170,6 +170,7 @@ def fit(self, X, y=None):
ValueError: If y was not passed in.
"""
if X is not None:
X = downcast_nullable_types(X)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downcasting because ARIMA doesn't support nullables

@@ -133,7 +134,7 @@ def fit(self, X, y=None):
cat_cols.remove(col)
bool_cols.append(col)

nan_ratio = X.ww.describe().loc["nan_count"] / X.shape[0]
nan_ratio = X.isna().sum() / X.shape[0]
Copy link
Contributor Author

@ParthivNaresh ParthivNaresh Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ww.describe is a comprehensive function that calculates a wide array of statistics. Here we're only interested in the nan count that Woodwork calculates using this. Changing this led to a large fit time improvement.

The only difference in this implementation is that nan LatLong values aren't supported which doesn't seem like a big factor to me especially since we don't have an inference function for latlongs yet

X_no_all_null.ww[numeric_col] = init_series(
imputed[numeric_col],
logical_type="Double",
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initializing each of these imputed columns as a series with a Double logical type (because they belong to X_numeric.columns, led to another large fit time improvement

else:
y_ww = y
return self.fit(X_ww, y_ww).transform(X_ww, y_ww)
return self.fit(X, y).transform(X, y)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infer_feature_types and checks for y were being made in fit and transform already, so this felt superfluous

for null_col in X.ww.select(IntegerNullable).columns
},
)
X.ww.init(schema=X.ww.schema)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping X.ww.init(schema=X.ww.schema) inside the conditional led to a large improvement in fit time

X = infer_feature_types(X)
if not isinstance(X, pd.DataFrame):
X = infer_feature_types(X)
X = X.select_dtypes(exclude=["datetime"])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saves a call to infer_feature_types since the call is also made in fit and transform

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a few questions/suggestions for now, will come back to finish review later!

evalml/pipelines/classification_pipeline.py Show resolved Hide resolved
evalml/tests/automl_tests/parallel_tests/test_cf_engine.py Outdated Show resolved Hide resolved
evalml/tests/automl_tests/test_automl.py Outdated Show resolved Hide resolved
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, just had a few questions/comments but I don't think anything's blocking. Not looking forward to handling the merge conflicts with my hardening PR, haha

Comment on lines +72 to +78
# TODO: Added this in because numpy's unique() does not support pandas.NA
try:
self._classes_ = list(ww.init_series(np.unique(y)))
except TypeError as e:
if "boolean value of NA is ambiguous" in str(e):
self._classes_ = y.unique()

Copy link
Contributor

@eccabay eccabay Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does y.unique() work in all cases? I'd rather keep things simpler if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was added just to replace np.unique in cases of pd.NA, otherwise the behaviour should be the same. @chukarsten was that the intention?

Comment on lines +177 to 179
X = downcast_int_nullable_to_double(X)
X = X.fillna(X.mean())
X, y = self._manage_woodwork(X, y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a guarantee here that X has woodwork information until after the call to manage_woodwork. Was this call added to work with the fillna or something later on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this call was added to work with fillna but also more generally because ARIMA won't accept Int64 dtypes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. It may be safer to move the call to manage_woodwork above this regardless, if I'm not missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant about moving it to manage_woodwork because that affects every instance where it's called. But I believe we're in the process of unifying nullable and non-nullable types so this won't be an issue soon

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry sorry, I just meant calling _manage_woodwork before calling downcast_int_nullable_to_double within this file, not within the _manage_woodwork function itself! Apologies if that was unclear.

@@ -311,5 +312,8 @@ def transform(self, X, y=None):

if cleaned_y is not None:
cleaned_y = cleaned_y["target"]
cleaned_y = ww.init_series(cleaned_y)

cleaned_x.ww.init()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any columns whose data types we know for certain at this point, so we can pass them to init and reduce the amount of type inference?

evalml/tests/component_tests/test_time_series_imputer.py Outdated Show resolved Hide resolved
two_distinct_with_nulls_y_ww.ww.init()
two_distinct_with_nulls_y_ww = ww.init_series(two_distinct_with_nulls_y_ww)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between these two functions? Why use one vs the other?

Copy link
Contributor Author

@ParthivNaresh ParthivNaresh Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because series.ww.init() doesn’t actually transform the underlying logical type as WoodworkColumnAccessor is being called, not WoodworkTableAccessor since this is a series and not a Dataframe.

Per here, “In cases where the LogicalType requires the Series dtype to change, a helper function ww.init_series must be used. This function will return a new Series object with Woodwork initialized and the dtype of the series changed to match the physical type of the LogicalType.”

Since the series is originally float due to the null present, it needs to have its physical type properly inferred and changed to match the IntegerNullable dtype of Int64.

evalml/tests/integration_tests/test_nullable_types.py Outdated Show resolved Hide resolved
evalml/utils/woodwork_utils.py Outdated Show resolved Hide resolved
return X


def downcast_int_nullable_to_double(X):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add a separate function here? Isn't this just a subset of the downcast_int_nullable_to_double behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions provide different use cases currently, I've filed an issue here!

Comment on lines +561 to +563
X_df = X_df.astype(
{"int col": float},
) # Convert to float as the imputer will do this as we're requesting the mean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need this casting, median is fine staying as an integer type!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe median can end up imputing a float between integer values so to play it safe, it's being cast as such in the SimpleImputer for both mean and median

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - amazing work @ParthivNaresh

try:
np_series = np_series.astype("int64")
except TypeError:
np_series = ww_series.astype(float).to_numpy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this used in case there are NaNs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so but @chukarsten might know better

new_ltypes_for_bool_nullable_cols = {
col: "Boolean" for col in X_bool_nullable_cols
}
X_int_nullable_cols = X_schema._filter_cols(include=["IntegerNullable"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just use the downcasting function here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that there's some issue with booleans being cast to floats when using downcast_nullable_types. I filed this so we can get some consistency in behaviour using a single downcast function.

@ParthivNaresh ParthivNaresh merged commit 034068a into main Sep 13, 2022
@ParthivNaresh ParthivNaresh deleted the woodwork_compat_0_18_0 branch September 13, 2022 13:27
@chukarsten chukarsten mentioned this pull request Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Re-enable Support for Dask/Concurrent Futures Engines
4 participants