Woodwork 0.18.0 compatibility #3700

ParthivNaresh · 2022-09-07T18:30:29Z

codecov · 2022-09-07T18:36:13Z

Codecov Report

Merging #3700 (1d41462) into main (e93f513) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3700     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        339     339             
  Lines      34307   34431    +124     
=======================================
+ Hits       34179   34304    +125     
+ Misses       128     127      -1

Impacted Files	Coverage Δ
.../tests/component_tests/test_datetime_featurizer.py	`100.0% <ø> (ø)`
.../component_tests/test_drop_nan_rows_transformer.py	`100.0% <ø> (ø)`
evalml/tests/component_tests/test_utils.py	`99.1% <ø> (-<0.1%)`	⬇️
evalml/tests/data_checks_tests/test_data_checks.py	`100.0% <ø> (ø)`
...valml/tests/pipeline_tests/test_component_graph.py	`99.9% <ø> (ø)`
evalml/utils/__init__.py	`100.0% <ø> (ø)`
evalml/model_understanding/metrics.py	`100.0% <100.0%> (ø)`
evalml/pipelines/classification_pipeline.py	`100.0% <100.0%> (ø)`
...omponents/estimators/regressors/arima_regressor.py	`100.0% <100.0%> (ø)`
...onents/estimators/regressors/catboost_regressor.py	`100.0% <100.0%> (ø)`
... and 26 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

ParthivNaresh · 2022-09-07T18:41:00Z

evalml/pipelines/components/transformers/preprocessing/replace_nullable_types.py

-        else:
-            y_ww = y
-        return self.fit(X_ww, y_ww).transform(X_ww, y_ww)
+        return self.fit(X, y).transform(X, y)


Both fit and transform already have infer_feature_types calls, these felt redundant

ParthivNaresh · 2022-09-07T18:42:59Z

evalml/pipelines/components/transformers/preprocessing/time_series_regularizer.py

@@ -311,5 +312,8 @@ def transform(self, X, y=None):

        if cleaned_y is not None:
            cleaned_y = cleaned_y["target"]
+            cleaned_y = ww.init_series(cleaned_y)
+
+        cleaned_x.ww.init()


No prior ww initialization existed for cleaned_X, and cleaned_y.

Are there any columns whose data types we know for certain at this point, so we can pass them to init and reduce the amount of type inference?

ParthivNaresh · 2022-09-07T18:44:07Z

evalml/pipelines/components/transformers/imputers/target_imputer.py

        y_t = ww.init_series(
            y_t,
-            logical_type=y_ww.ww.logical_type,
+            logical_type=new_logical_type,


y no longer needs to be IntegerNullable post imputation

ParthivNaresh · 2022-09-07T18:45:20Z

.github/meta.yaml

@@ -37,7 +37,7 @@ outputs:
        - requirements-parser >=0.2.0
        - shap >=0.40.0
        - texttable >=1.6.2
-        - woodwork >=0.16.2, < 0.17.0
+        - woodwork >=0.18.0


Supporting minimum of woodwork==0.18.0; I wanted to make sure the replace_nan optimizations were supported at a minimum but feel free to push back on this

ParthivNaresh · 2022-09-08T13:43:00Z

evalml/pipelines/components/estimators/regressors/arima_regressor.py

@@ -170,6 +170,7 @@ def fit(self, X, y=None):
            ValueError: If y was not passed in.
        """
        if X is not None:
+            X = downcast_nullable_types(X)


Downcasting because ARIMA doesn't support nullables

ParthivNaresh · 2022-09-08T14:00:07Z

evalml/pipelines/components/transformers/imputers/imputer.py

@@ -133,7 +134,7 @@ def fit(self, X, y=None):
                cat_cols.remove(col)
                bool_cols.append(col)

-        nan_ratio = X.ww.describe().loc["nan_count"] / X.shape[0]
+        nan_ratio = X.isna().sum() / X.shape[0]


ww.describe is a comprehensive function that calculates a wide array of statistics. Here we're only interested in the nan count that Woodwork calculates using this. Changing this led to a large fit time improvement.

The only difference in this implementation is that nan LatLong values aren't supported which doesn't seem like a big factor to me especially since we don't have an inference function for latlongs yet

ParthivNaresh · 2022-09-08T14:01:53Z

evalml/pipelines/components/transformers/imputers/imputer.py

+                X_no_all_null.ww[numeric_col] = init_series(
+                    imputed[numeric_col],
+                    logical_type="Double",
+                )


Initializing each of these imputed columns as a series with a Double logical type (because they belong to X_numeric.columns, led to another large fit time improvement

ParthivNaresh · 2022-09-08T14:16:21Z

evalml/pipelines/components/transformers/preprocessing/replace_nullable_types.py

-        else:
-            y_ww = y
-        return self.fit(X_ww, y_ww).transform(X_ww, y_ww)
+        return self.fit(X, y).transform(X, y)


infer_feature_types and checks for y were being made in fit and transform already, so this felt superfluous

ParthivNaresh · 2022-09-08T14:17:24Z

evalml/pipelines/components/transformers/samplers/base_sampler.py

+                        for null_col in X.ww.select(IntegerNullable).columns
+                    },
+                )
+            X.ww.init(schema=X.ww.schema)


Keeping X.ww.init(schema=X.ww.schema) inside the conditional led to a large improvement in fit time

ParthivNaresh · 2022-09-08T14:25:03Z

evalml/pipelines/components/transformers/scalers/standard_scaler.py

-        X = infer_feature_types(X)
+        if not isinstance(X, pd.DataFrame):
+            X = infer_feature_types(X)
+        X = X.select_dtypes(exclude=["datetime"])


Saves a call to infer_feature_types since the call is also made in fit and transform

bchen1116

Leaving a few questions/suggestions for now, will come back to finish review later!

evalml/pipelines/classification_pipeline.py

evalml/pipelines/components/transformers/samplers/base_sampler.py

evalml/tests/automl_tests/parallel_tests/test_automl_dask.py

evalml/tests/automl_tests/parallel_tests/test_cf_engine.py

evalml/tests/automl_tests/test_automl.py

eccabay

Looks pretty good, just had a few questions/comments but I don't think anything's blocking. Not looking forward to handling the merge conflicts with my hardening PR, haha

eccabay · 2022-09-09T14:52:28Z

evalml/pipelines/classification_pipeline.py

+        # TODO: Added this in because numpy's unique() does not support pandas.NA
+        try:
+            self._classes_ = list(ww.init_series(np.unique(y)))
+        except TypeError as e:
+            if "boolean value of NA is ambiguous" in str(e):
+                self._classes_ = y.unique()
+


Does y.unique() work in all cases? I'd rather keep things simpler if possible.

I think this was added just to replace np.unique in cases of pd.NA, otherwise the behaviour should be the same. @chukarsten was that the intention?

eccabay · 2022-09-09T14:57:14Z

evalml/pipelines/components/estimators/regressors/arima_regressor.py

+            X = downcast_int_nullable_to_double(X)
            X = X.fillna(X.mean())
        X, y = self._manage_woodwork(X, y)


We don't have a guarantee here that X has woodwork information until after the call to manage_woodwork. Was this call added to work with the fillna or something later on?

Yes this call was added to work with fillna but also more generally because ARIMA won't accept Int64 dtypes

Gotcha. It may be safer to move the call to manage_woodwork above this regardless, if I'm not missing something.

I'm hesitant about moving it to manage_woodwork because that affects every instance where it's called. But I believe we're in the process of unifying nullable and non-nullable types so this won't be an issue soon

Sorry sorry, I just meant calling _manage_woodwork before calling downcast_int_nullable_to_double within this file, not within the _manage_woodwork function itself! Apologies if that was unclear.

eccabay · 2022-09-09T15:12:27Z

evalml/pipelines/components/transformers/preprocessing/time_series_regularizer.py

@@ -311,5 +312,8 @@ def transform(self, X, y=None):

        if cleaned_y is not None:
            cleaned_y = cleaned_y["target"]
+            cleaned_y = ww.init_series(cleaned_y)
+
+        cleaned_x.ww.init()


Are there any columns whose data types we know for certain at this point, so we can pass them to init and reduce the amount of type inference?

evalml/tests/component_tests/test_time_series_featurizer.py

evalml/tests/component_tests/test_time_series_imputer.py

eccabay · 2022-09-09T17:16:35Z

evalml/tests/data_checks_tests/test_no_variance_data_check.py

-two_distinct_with_nulls_y_ww.ww.init()
+two_distinct_with_nulls_y_ww = ww.init_series(two_distinct_with_nulls_y_ww)


What's the difference between these two functions? Why use one vs the other?

This is because series.ww.init() doesn’t actually transform the underlying logical type as WoodworkColumnAccessor is being called, not WoodworkTableAccessor since this is a series and not a Dataframe.

Per here, “In cases where the LogicalType requires the Series dtype to change, a helper function ww.init_series must be used. This function will return a new Series object with Woodwork initialized and the dtype of the series changed to match the physical type of the LogicalType.”

Since the series is originally float due to the null present, it needs to have its physical type properly inferred and changed to match the IntegerNullable dtype of Int64.

evalml/tests/integration_tests/test_nullable_types.py

evalml/utils/woodwork_utils.py

eccabay · 2022-09-09T17:20:10Z

evalml/utils/woodwork_utils.py

+    return X
+
+
+def downcast_int_nullable_to_double(X):


Why add a separate function here? Isn't this just a subset of the downcast_int_nullable_to_double behavior?

These two functions provide different use cases currently, I've filed an issue here!

eccabay · 2022-09-09T18:08:35Z

evalml/tests/component_tests/test_simple_imputer.py

+            X_df = X_df.astype(
+                {"int col": float},
+            )  # Convert to float as the imputer will do this as we're requesting the mean


We shouldn't need this casting, median is fine staying as an integer type!

I believe median can end up imputing a float between integer values so to play it safe, it's being cast as such in the SimpleImputer for both mean and median

evalml/tests/component_tests/test_utils.py

jeremyliweishih

LGTM - amazing work @ParthivNaresh

jeremyliweishih · 2022-09-12T18:10:21Z

evalml/model_understanding/metrics.py

+        try:
+            np_series = np_series.astype("int64")
+        except TypeError:
+            np_series = ww_series.astype(float).to_numpy()


is this used in case there are NaNs?

I think so but @chukarsten might know better

jeremyliweishih · 2022-09-12T18:13:22Z

evalml/pipelines/components/transformers/imputers/knn_imputer.py

-        new_ltypes_for_bool_nullable_cols = {
-            col: "Boolean" for col in X_bool_nullable_cols
-        }
+        X_int_nullable_cols = X_schema._filter_cols(include=["IntegerNullable"])


should we just use the downcasting function here?

It seems that there's some issue with booleans being cast to floats when using downcast_nullable_types. I filed this so we can get some consistency in behaviour using a single downcast function.

ParthivNaresh commented Sep 7, 2022

View reviewed changes

ParthivNaresh commented Sep 8, 2022

View reviewed changes

ParthivNaresh marked this pull request as ready for review September 8, 2022 15:09

auto-assign bot assigned ParthivNaresh Sep 8, 2022

ParthivNaresh requested review from chukarsten, jeremyliweishih, eccabay, bchen1116, christopherbunn, Cmancuso, fjlanasa and jeff-hernandez and removed request for chukarsten September 8, 2022 15:52

bchen1116 reviewed Sep 8, 2022

View reviewed changes

eccabay approved these changes Sep 9, 2022

View reviewed changes

eccabay reviewed Sep 9, 2022

View reviewed changes

evalml/tests/component_tests/test_utils.py Outdated Show resolved Hide resolved

jeremyliweishih approved these changes Sep 12, 2022

View reviewed changes

add comma

1d41462

ParthivNaresh force-pushed the woodwork_compat_0_18_0 branch from 5ce24ce to 1d41462 Compare September 13, 2022 12:48

ParthivNaresh merged commit 034068a into main Sep 13, 2022

ParthivNaresh deleted the woodwork_compat_0_18_0 branch September 13, 2022 13:27

chukarsten mentioned this pull request Sep 20, 2022

Release v0.58.0 #3724

Merged

tamargrey mentioned this pull request Feb 15, 2023

Remove Nullable type logic from Imputer Components and Refactor #3999

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Woodwork 0.18.0 compatibility #3700

Woodwork 0.18.0 compatibility #3700

ParthivNaresh commented Sep 7, 2022 •

edited by jeremyliweishih

Loading

codecov bot commented Sep 7, 2022 •

edited

Loading

ParthivNaresh Sep 7, 2022

ParthivNaresh Sep 7, 2022

eccabay Sep 9, 2022

ParthivNaresh Sep 7, 2022

ParthivNaresh Sep 7, 2022

ParthivNaresh Sep 8, 2022

ParthivNaresh Sep 8, 2022 •

edited

Loading

ParthivNaresh Sep 8, 2022

ParthivNaresh Sep 8, 2022

ParthivNaresh Sep 8, 2022

ParthivNaresh Sep 8, 2022

bchen1116 left a comment

eccabay left a comment

eccabay Sep 9, 2022 •

edited

Loading

ParthivNaresh Sep 12, 2022

eccabay Sep 9, 2022

ParthivNaresh Sep 12, 2022

eccabay Sep 12, 2022

ParthivNaresh Sep 12, 2022

eccabay Sep 13, 2022

eccabay Sep 9, 2022

eccabay Sep 9, 2022

ParthivNaresh Sep 12, 2022 •

edited

Loading

eccabay Sep 9, 2022

ParthivNaresh Sep 12, 2022

eccabay Sep 9, 2022

ParthivNaresh Sep 12, 2022

jeremyliweishih left a comment

jeremyliweishih Sep 12, 2022

ParthivNaresh Sep 12, 2022

jeremyliweishih Sep 12, 2022

ParthivNaresh Sep 12, 2022

		two_distinct_with_nulls_y_ww.ww.init()
		two_distinct_with_nulls_y_ww = ww.init_series(two_distinct_with_nulls_y_ww)

Woodwork 0.18.0 compatibility #3700

Woodwork 0.18.0 compatibility #3700

Conversation

ParthivNaresh commented Sep 7, 2022 • edited by jeremyliweishih Loading

codecov bot commented Sep 7, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParthivNaresh Sep 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

eccabay left a comment

Choose a reason for hiding this comment

eccabay Sep 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParthivNaresh Sep 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremyliweishih left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParthivNaresh commented Sep 7, 2022 •

edited by jeremyliweishih

Loading

codecov bot commented Sep 7, 2022 •

edited

Loading

ParthivNaresh Sep 8, 2022 •

edited

Loading

eccabay Sep 9, 2022 •

edited

Loading

ParthivNaresh Sep 12, 2022 •

edited

Loading