-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Woodwork 0.17.2 compatibility :( #3626
Conversation
a253784
to
a3aa645
Compare
Codecov Report
@@ Coverage Diff @@
## main #3626 +/- ##
=======================================
- Coverage 99.7% 99.7% -0.0%
=======================================
Files 335 335
Lines 33750 33839 +89
=======================================
+ Hits 33627 33714 +87
- Misses 123 125 +2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
…ble types temporary change to ClassificationPipeline.
…_nullable_types_builds_pipelines.
…uter to, column by column, use the woodwork accessor to set the schema.
…x/evalml into latest-dep-update-d117391
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work man, this was truly next level finessing. I think some outstanding questions/issues are:
- Why is the
ClassImbalanceDataCheck
potentially casting2.0
to2
? - Issue has been filed to keep pseudo floats like
2.0, 12.0, -4.0, etc
asDouble
instead ofInteger
orIntegerNullable
. - Issue has been filed for support of multiple column assignments
- Issue has been filed for no exception when casting to a
Boolean
. - Issue has been filed to match
ww.init()
behaviour across DataFrames and Series.
@@ -161,7 +161,8 @@ def transform(self, X, y=None): | |||
if self._numeric_cols is not None and len(self._numeric_cols) > 0: | |||
X_numeric = X.ww[self._numeric_cols.tolist()] | |||
imputed = self._numeric_imputer.transform(X_numeric) | |||
X_no_all_null[X_numeric.columns] = imputed | |||
for numeric_col in X_numeric.columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've filed it
imputed.bfill(inplace=True) # Fill in the first value, if missing | ||
X_not_all_null[X_interpolate.columns] = imputed | ||
X_not_all_null.ww.init(schema=X_schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reinitializes the dataframe with the original schema excluding IntegerNullable
and BooleanNullable
types so that they can be reinferred post imputation
y_imputed.bfill(inplace=True) | ||
y_imputed.ww.init(schema=y.ww.schema) | ||
|
||
X_not_all_null.ww.init(schema=X_schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Covered as part of test_numeric_only_input
and test_imputer_bool_dtype_object
@@ -311,5 +312,8 @@ def transform(self, X, y=None): | |||
|
|||
if cleaned_y is not None: | |||
cleaned_y = cleaned_y["target"] | |||
cleaned_y = ww.init_series(cleaned_y) | |||
|
|||
cleaned_x.ww.init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduction of nulls makes initialization necessary here
@@ -380,3 +381,27 @@ def make_balancing_dictionary(y, sampling_ratio): | |||
# this class is already larger than the ratio, don't change | |||
class_dic[index] = value_counts[index] | |||
return class_dic | |||
|
|||
|
|||
def downcast_int_nullable_to_double(X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A function that helps with some components not accepting an IntegerArray
or being unable to cast values from a float
to an int
@@ -88,14 +90,14 @@ def test_numeric_only_input(imputer_test_data): | |||
expected = pd.DataFrame( | |||
{ | |||
"int col": [0, 1, 2, 0, 3] * 4, | |||
"float col": [0.0, 1.0, 0.0, -2.0, 5.0] * 4, | |||
"float col": [0.1, 1.0, 0.0, -2.0, 5.0] * 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed this
@@ -1982,7 +1982,7 @@ def imputer_test_data(): | |||
), | |||
"int col": [0, 1, 2, 0, 3] * 4, | |||
"object col": ["b", "b", "a", "c", "d"] * 4, | |||
"float col": [0.0, 1.0, 0.0, -2.0, 5.0] * 4, | |||
"float col": [0.1, 1.0, 0.0, -2.0, 5.0] * 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed this
…evalml into ww_0.17.0_compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just left some small efficiency and clarification questions.
@@ -4016,7 +4016,7 @@ def test_automl_baseline_pipeline_predictions_and_scores_time_series(problem_typ | |||
expected_predictions = pd.Series(expected_predictions, name="target_delay_1") | |||
|
|||
preds = baseline.predict(X_validation, None, X_train, y_train) | |||
pd.testing.assert_series_equal(expected_predictions, preds) | |||
pd.testing.assert_series_equal(expected_predictions, preds, check_dtype=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This worries me slightly - are there any scenarios where this would cause us issues down the road?
evalml/pipelines/components/transformers/samplers/base_sampler.py
Outdated
Show resolved
Hide resolved
@@ -288,6 +288,7 @@ def transform(self, X, y=None): | |||
delayed_features = self._compute_delays(X_ww, y) | |||
rolling_means = self._compute_rolling_transforms(X_ww, y, original_features) | |||
features = ww.concat_columns([delayed_features, rolling_means]) | |||
features.ww.init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse any part of the initial schema or use what we know about the dtypes of these features here to reduce the amount of type reinference this might introduce?
@@ -66,7 +66,16 @@ def fit(self, X, y): | |||
) | |||
|
|||
self._fit(X, y) | |||
self._classes_ = list(ww.init_series(np.unique(y))) | |||
|
|||
# TODO: Added this in because numpy's unique() does not support pandas.NA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's a workaround for this error, why do we start off by attempting to use numpy? Are there downsides to just using y.unique()
in all cases instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chukarsten this looks good to me! Apologies for all the "did we file an issue" comments. Just wanted to make sure we're keep track of removing these fixes once the appropriate WW changes are in. 😄
@@ -66,7 +66,16 @@ def fit(self, X, y): | |||
) | |||
|
|||
self._fit(X, y) | |||
self._classes_ = list(ww.init_series(np.unique(y))) | |||
|
|||
# TODO: Added this in because numpy's unique() does not support pandas.NA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have an issue filed to resolve this?
@@ -165,7 +166,10 @@ def transform(self, X, y=None): | |||
|
|||
if self._interpolate_cols is not None: | |||
X_interpolate = X.ww[self._interpolate_cols] | |||
imputed = X_interpolate.interpolate() | |||
# TODO: Revert when pandas introduces Float64 dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have an issue filed to track this?
@@ -4016,7 +4016,7 @@ def test_automl_baseline_pipeline_predictions_and_scores_time_series(problem_typ | |||
expected_predictions = pd.Series(expected_predictions, name="target_delay_1") | |||
|
|||
preds = baseline.predict(X_validation, None, X_train, y_train) | |||
pd.testing.assert_series_equal(expected_predictions, preds) | |||
pd.testing.assert_series_equal(expected_predictions, preds, check_dtype=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a little confused here - is preds
coming out as an integers here and why if so?
@@ -328,6 +343,7 @@ def test_delayed_feature_extractor_numpy(mock_roll, delayed_features_data): | |||
"target_delay_11": y_answer.shift(11), | |||
}, | |||
) | |||
answer_only_y.ww.init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was the new ww.init
call in TimeSeriesFeaturizer
in response to this? If not should we file an issue to cover this?
* Revert "Woodwork 0.17.2 compatibility :( (#3626)" This reverts commit 12ec98e. * Updated the release and pandas version. Co-authored-by: Karsten Chu <[email protected]>
All the work required to get EvalML compatible with Woodwork 0.17.2.