Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Woodwork 0.17.2 compatibility :( #3626

Merged
merged 109 commits into from
Aug 10, 2022
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
c41d1de
Initial.
chukarsten Jul 24, 2022
3b18083
Fixed test_datetime_featurizer_with_inconsistent_date_format.
chukarsten Jul 25, 2022
97157e0
Fixed test_drop_nan_rows_transformer.py test_drop_null_columns_transf…
chukarsten Jul 25, 2022
6dfd577
Fixed test_drop_null_columns_transformer.py.
chukarsten Jul 25, 2022
5bbf65d
Fixed test_drop_null_columns_transformer.py
chukarsten Jul 25, 2022
70d038c
Fixed test_imputer.py
chukarsten Jul 25, 2022
43ece37
Fixed test_fit_transform_drop_all_nan_columns
chukarsten Jul 25, 2022
638705d
Updated to Woodwork 0.17.0
chukarsten Jul 25, 2022
4e47f78
Fixed test_per_column_imputer.py
chukarsten Jul 25, 2022
a3aa645
Fixed imputers.
chukarsten Jul 27, 2022
7cdeb1d
Fixed test_automl.py
chukarsten Jul 27, 2022
b50ad59
Adjusted test_imputer.py to match the imputer test data.
chukarsten Jul 27, 2022
c0a7566
Adjusted test_imputer.py to match the imputer test data.
chukarsten Jul 27, 2022
ca24080
Fixed test_simple_imputer.py again.
chukarsten Jul 27, 2022
a55ec4f
Release.
chukarsten Jul 27, 2022
e294000
Latest deps and conda.
chukarsten Jul 27, 2022
c69ec06
Fixed test_target_imputer.py
chukarsten Jul 27, 2022
263b9a0
Fixed test_time_series_featurizer.py
chukarsten Jul 28, 2022
54ec0cc
Fixed the time_series_imputer some more.
chukarsten Jul 28, 2022
625681b
Fixed test_class_imabalance_data_check
chukarsten Jul 28, 2022
bbb6820
Fixed test_data_checks.
chukarsten Jul 29, 2022
2a92d38
Temporarily addressed test_no_variance_data_check.
chukarsten Jul 29, 2022
79cb5c7
Fixed test_data_checks_and_actions_integration which involved a nulla…
chukarsten Jul 29, 2022
3c1ebee
Fixed test_data_checks_and_actions_integration.
chukarsten Jul 29, 2022
3bc8191
Removed the differentiation between pandas and woodwork tests in test…
chukarsten Jul 29, 2022
a6db357
Fixed the test_email_url_whatever...involved having to modify the Imp…
chukarsten Aug 1, 2022
573fc6f
Modified the target_imputer.py to properly transform the target dtypes.
chukarsten Aug 2, 2022
e6363aa
Fixed test_graph_roc_curve_nans
chukarsten Aug 2, 2022
35435f8
Merge branch 'main' into ww_0.17.0_compatibility
ParthivNaresh Aug 3, 2022
dc61825
time series imputer changes
ParthivNaresh Aug 3, 2022
b661aab
Merge branch 'main' into ww_0.17.0_compatibility
ParthivNaresh Aug 3, 2022
d261e71
expand exclusion ltypes for standard scaler
ParthivNaresh Aug 3, 2022
b95b3d6
Merge branch 'ww_0.17.0_compatibility' of https://github.com/alteryx/…
ParthivNaresh Aug 3, 2022
33b8a9e
Changes to the standard_scaler and base_sampler to fix test_can_run_a…
chukarsten Aug 4, 2022
ea97418
Merge branch 'ww_0.17.0_compatibility' of https://github.com/alteryx/…
chukarsten Aug 4, 2022
7852c11
Update test_nullable_types.py
chukarsten Aug 4, 2022
3ebbafd
lint fix
ParthivNaresh Aug 4, 2022
0341777
lint fixes and no variance test
ParthivNaresh Aug 4, 2022
50ccf33
Relint
chukarsten Aug 4, 2022
8ebcb48
Disabled some parallel tests to jibe with Woodwork 0.17.x. These nee…
chukarsten Aug 4, 2022
798d2e7
Fixed test_datetime_featurizer_with_inconsistent_date_format.
chukarsten Jul 25, 2022
ee88e89
Fixed test_drop_nan_rows_transformer.py test_drop_null_columns_transf…
chukarsten Jul 25, 2022
dd21c9a
Fixed test_drop_null_columns_transformer.py.
chukarsten Jul 25, 2022
e9aec48
Fixed test_drop_null_columns_transformer.py
chukarsten Jul 25, 2022
0b0cd59
Fixed test_imputer.py
chukarsten Jul 25, 2022
dde9c77
Fixed test_fit_transform_drop_all_nan_columns
chukarsten Jul 25, 2022
b94bb37
Updated to Woodwork 0.17.0
chukarsten Jul 25, 2022
63c3168
Fixed test_per_column_imputer.py
chukarsten Jul 25, 2022
7a8ba71
Fixed imputers.
chukarsten Jul 27, 2022
664e4f9
Fixed test_automl.py
chukarsten Jul 27, 2022
9acd1bc
Adjusted test_imputer.py to match the imputer test data.
chukarsten Jul 27, 2022
9db71c0
Adjusted test_imputer.py to match the imputer test data.
chukarsten Jul 27, 2022
7d78f9f
Fixed test_simple_imputer.py again.
chukarsten Jul 27, 2022
dda2b05
Release.
chukarsten Jul 27, 2022
2db0718
Latest deps and conda.
chukarsten Jul 27, 2022
2a1b8b8
Fixed test_target_imputer.py
chukarsten Jul 27, 2022
ec47fc2
Fixed test_time_series_featurizer.py
chukarsten Jul 28, 2022
426ed33
Fixed the time_series_imputer some more.
chukarsten Jul 28, 2022
21d987d
Fixed test_class_imabalance_data_check
chukarsten Jul 28, 2022
3cde5c4
Fixed test_data_checks.
chukarsten Jul 29, 2022
7b81eb3
Temporarily addressed test_no_variance_data_check.
chukarsten Jul 29, 2022
187d7ff
Fixed test_data_checks_and_actions_integration which involved a nulla…
chukarsten Jul 29, 2022
9c64c4c
Fixed test_data_checks_and_actions_integration.
chukarsten Jul 29, 2022
9b5d73c
Removed the differentiation between pandas and woodwork tests in test…
chukarsten Jul 29, 2022
3d6cc07
Fixed the test_email_url_whatever...involved having to modify the Imp…
chukarsten Aug 1, 2022
b42dff0
Modified the target_imputer.py to properly transform the target dtypes.
chukarsten Aug 2, 2022
c23443e
Fixed test_graph_roc_curve_nans
chukarsten Aug 2, 2022
8227596
Changes to the standard_scaler and base_sampler to fix test_can_run_a…
chukarsten Aug 4, 2022
cd9ff14
time series imputer changes
ParthivNaresh Aug 3, 2022
2812e99
expand exclusion ltypes for standard scaler
ParthivNaresh Aug 3, 2022
e920710
Update test_nullable_types.py
chukarsten Aug 4, 2022
e029d15
Relint
chukarsten Aug 4, 2022
4f8eb17
Disabled some parallel tests to jibe with Woodwork 0.17.x. These nee…
chukarsten Aug 4, 2022
654ca9b
Lint.
chukarsten Aug 4, 2022
a8039a6
Merge branch 'ww_0.17.0_compatibility' of https://github.com/alteryx/…
chukarsten Aug 4, 2022
1ec5aee
Lint.
chukarsten Aug 4, 2022
7f27e49
Latest depts.
chukarsten Aug 4, 2022
8dbf0d5
Merge branch 'main' into ww_0.17.0_compatibility
chukarsten Aug 5, 2022
3c734b0
Lint and other mistakes.
chukarsten Aug 5, 2022
c5fc0b5
Marked all test_automl_dask as pytest.mark.xfail.
chukarsten Aug 5, 2022
3904a3c
Ugh
chukarsten Aug 5, 2022
295e701
I'm so tired.
chukarsten Aug 5, 2022
6c5bd5d
xfailed a few more parallel tests.
chukarsten Aug 5, 2022
6132cb2
Changed pytest marks.
chukarsten Aug 5, 2022
71a367c
Update latest dependencies
github-actions[bot] Aug 5, 2022
649bc33
xfailed another parallel test.
chukarsten Aug 5, 2022
82ae057
changes to estimators for lack of Int64 support
ParthivNaresh Aug 5, 2022
c325c39
Merge branch 'ww_0.17.0_compatibility' of https://github.com/alteryx/…
ParthivNaresh Aug 5, 2022
4579127
Merge branch 'ww_0.17.0_compatibility' of https://github.com/alteryx/…
chukarsten Aug 5, 2022
7325a4d
some fixes
ParthivNaresh Aug 5, 2022
1b7434e
Merge branch 'ww_0.17.0_compatibility' of https://github.com/alteryx/…
ParthivNaresh Aug 5, 2022
b95207a
fix t_sne tests
ParthivNaresh Aug 5, 2022
969eba3
Update latest dependencies
github-actions[bot] Aug 5, 2022
dd1346b
dep scikit
ParthivNaresh Aug 5, 2022
860ba3a
test fix
ParthivNaresh Aug 5, 2022
ce7af81
Merge branch 'latest-dep-update-d117391' of https://github.com/altery…
ParthivNaresh Aug 5, 2022
e3649ad
Merge branch 'latest-dep-update-d117391' into ww_0.17.0_compatibility
ParthivNaresh Aug 5, 2022
dff1c4a
fix docs
ParthivNaresh Aug 5, 2022
f324962
update min woodwork
ParthivNaresh Aug 5, 2022
49f2d5d
update woodwork version
ParthivNaresh Aug 5, 2022
2412b2c
test coverage
ParthivNaresh Aug 5, 2022
64a63ee
lint
ParthivNaresh Aug 5, 2022
022e4f7
Merge branch 'main' into ww_0.17.0_compatibility
ParthivNaresh Aug 5, 2022
8962f70
Merge branch 'main' into ww_0.17.0_compatibility
ParthivNaresh Aug 5, 2022
d8bd307
update release notes and ts imputer test
ParthivNaresh Aug 7, 2022
c86479c
Merge branch 'ww_0.17.0_compatibility' of https://github.com/alteryx/…
ParthivNaresh Aug 7, 2022
e7d6ff0
Swapped ww init with infer_feature_types.
chukarsten Aug 8, 2022
57366ea
Updated base_sampler to pass the current schema forward.
chukarsten Aug 8, 2022
69156a1
Merge branch 'main' into ww_0.17.0_compatibility
chukarsten Aug 9, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ outputs:
- requirements-parser >=0.2.0
- shap >=0.40.0
- texttable >=1.6.2
- woodwork >=0.16.2, < 0.17.0
- woodwork >=0.17.0
- featuretools>=1.7.0
- nlp-primitives>=2.1.0,!=2.6.0
- python >=3.8.*
Expand Down
2 changes: 1 addition & 1 deletion core-requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ requirements-parser>=0.2.0
shap>=0.40.0
statsmodels>=0.12.2
texttable>=1.6.2
woodwork>=0.16.2, < 0.17.0
woodwork>=0.17.0
dask>=2021.10.0
nlp-primitives>=2.1.0,!=2.6.0
featuretools>=1.7.0
Expand Down
18 changes: 13 additions & 5 deletions docs/source/release_notes.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,19 @@
Release Notes
-------------
**Future Releases**
* Enhancements
* Updated to run with Woodwork >= 0.17.0 :pr:`3626`
* Fixes
* Changes
* Documentation Changes
* Testing Changes

.. warning::

**Breaking Changes**


**v0.55.0 July. 24, 2022**
* Enhancements
* Increased the amount of logical type information passed to Woodwork when calling ``ww.init()`` in transformers :pr:`3604`
* Added ability to log how long each batch and pipeline take in ``automl.search()`` :pr:`3577`
Expand All @@ -13,16 +26,11 @@ Release Notes
* Bump minimum scikit-optimize version to 0.9.0 `:pr:`3614`
* Changes
* Add pre-commit hooks for linting :pr:`3608`
* Documentation Changes
* Testing Changes
* Pinned GraphViz version for Windows CI Test :pr:`3596`
* Removed ``pytest.mark.skip_if_39`` pytest marker :pr:`3602` :pr:`3607`
* Refactored test cases that iterate over all components to use ``pytest.mark.parametrise`` and changed the corresponding ``if...continue`` blocks to ``pytest.mark.xfail`` :pr:`3622`

.. warning::

**Breaking Changes**


**v0.54.0 June. 23, 2022**
* Fixes
Expand Down
2 changes: 1 addition & 1 deletion evalml/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,4 @@
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

__version__ = "0.54.0"
__version__ = "0.55.0"
11 changes: 10 additions & 1 deletion evalml/pipelines/classification_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,16 @@ def fit(self, X, y):
)

self._fit(X, y)
self._classes_ = list(ww.init_series(np.unique(y)))

# TODO: Added this in because numpy's unique() does not support pandas.NA
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary addition due to lack of nullable types support within numpy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have an issue filed to resolve this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a workaround for this error, why do we start off by attempting to use numpy? Are there downsides to just using y.unique() in all cases instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tied to this: #3649

try:
self._classes_ = list(ww.init_series(np.unique(y)))
except TypeError as e:
if "boolean value of NA is ambiguous" in str(e):
self._classes_ = y.unique()
except Exception as e:
raise e

return self

def _encode_targets(self, y):
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
"""Component that imputes missing data according to a specified timeseries-specific imputation strategy."""
import pandas as pd
import woodwork as ww

from evalml.pipelines.components.transformers import Transformer
from evalml.utils import infer_feature_types
Expand Down Expand Up @@ -165,7 +166,10 @@ def transform(self, X, y=None):

if self._interpolate_cols is not None:
X_interpolate = X.ww[self._interpolate_cols]
imputed = X_interpolate.interpolate()
# TODO: Revert when pandas introduces Float64 dtype
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas is working on a Float64 datatype to go hand in hand with Int64 nullable integers and nullable booleans. When that becomes a thing, we can get rid of this as Woodwork will probably infer Float64 like it is the other nullable types.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have an issue filed to track this?

imputed = X_interpolate.astype(
float,
).interpolate() # Cast to float because Int64 not handled
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that pandas' interpolate won't run on the new nullable integer. They are tracking this and I commented on the relevant issue pandas-dev/pandas#40252

imputed.bfill(inplace=True) # Fill in the first value, if missing
X_not_all_null[X_interpolate.columns] = imputed

Expand All @@ -178,10 +182,9 @@ def transform(self, X, y=None):
y_imputed = y.bfill()
y_imputed.pad(inplace=True)
elif self._impute_target == "interpolate":
y_imputed = y.interpolate()
# TODO: Revert when pandas introduces Float64 dtype
y_imputed = y.astype(float).interpolate()
y_imputed.bfill(inplace=True)
y_imputed.ww.init(schema=y.ww.schema)

X_not_all_null.ww.init(schema=X_schema)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to get rid of this because of the casting to float and interpolation was trying to overwrite the new float dtype with the original Int64 dtype. We might need to add some testing for this...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered as part of test_numeric_only_input and test_imputer_bool_dtype_object

y_imputed = ww.init_series(y_imputed)

return X_not_all_null, y_imputed
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,7 @@ def transform(self, X, y=None):
delayed_features = self._compute_delays(X_ww, y)
rolling_means = self._compute_rolling_transforms(X_ww, y, original_features)
features = ww.concat_columns([delayed_features, rolling_means])
features.ww.init()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was happening here was that the delayed_features were half np.NaN and half pd.NA. re-init'ing standardized the columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse any part of the initial schema or use what we know about the dtypes of these features here to reduce the amount of type reinference this might introduce?

return features.ww.drop(original_features)

def fit_transform(self, X, y=None):
Expand Down
2 changes: 1 addition & 1 deletion evalml/tests/automl_tests/test_automl.py
Original file line number Diff line number Diff line change
Expand Up @@ -4016,7 +4016,7 @@ def test_automl_baseline_pipeline_predictions_and_scores_time_series(problem_typ
expected_predictions = pd.Series(expected_predictions, name="target_delay_1")

preds = baseline.predict(X_validation, None, X_train, y_train)
pd.testing.assert_series_equal(expected_predictions, preds)
pd.testing.assert_series_equal(expected_predictions, preds, check_dtype=False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure here was during a time series regression problem, where the predictions came out as integers, rather than floats. Given that it's a time series regression problem, I would expect the predictions to be floats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worries me slightly - are there any scenarios where this would cause us issues down the road?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a little confused here - is preds coming out as an integers here and why if so?

if is_classification(problem_type):
pd.testing.assert_frame_equal(
expected_predictions_proba,
Expand Down
13 changes: 10 additions & 3 deletions evalml/tests/component_tests/test_datetime_featurizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -288,10 +288,17 @@ def test_datetime_featurizer_with_inconsistent_date_format():
answer = pd.DataFrame(
{
"numerical": [0] * len(dates),
"date col_year": [2021.0] * 18 + [np.nan] * 2,
"date col_month": [9.0] * 18 + [np.nan] * 2,
"date col_year": [2021] * 18 + [pd.NA] * 2,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new pandas nullable types return pd.NA

"date col_month": [9] * 18 + [pd.NA] * 2,
"date col_day_of_week": expected_dow,
"date col_hour": [0.0] * 18 + [np.nan] * 2,
"date col_hour": [0] * 18 + [pd.NA] * 2,
},
).astype(
dtype={
"date col_year": "Int64",
"date col_month": "Int64",
"date col_day_of_week": "Int64",
"date col_hour": "Int64",
},
)
pd.testing.assert_frame_equal(answer, expected)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,7 @@ def test_drop_rows_transformer():
X_expected = pd.DataFrame(
{"a column": [3], "another col": [6]},
index=[2],
dtype=np.float64,
)
).astype("Int64")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of returning the numpy version of nan, which is a float, woodwork inference returns the new Int64 nullable integer type.

drop_rows_transformer = DropNaNRowsTransformer()
drop_rows_transformer.fit(X)
transformed_X, _ = drop_rows_transformer.transform(X)
Expand Down
16 changes: 13 additions & 3 deletions evalml/tests/component_tests/test_drop_null_columns_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ def test_drop_null_transformer_transform_default_pct_null_threshold():
X = pd.DataFrame(
{"lots_of_null": [None, None, None, None, 5], "no_null": [1, 2, 3, 4, 5]},
)
X_expected = X.astype({"lots_of_null": "float64", "no_null": "int64"})
X_expected = X.astype({"lots_of_null": "Int64", "no_null": "int64"})
drop_null_transformer.fit(X)
X_t = drop_null_transformer.transform(X)
assert_frame_equal(X_expected, X_t)
Expand Down Expand Up @@ -94,7 +94,12 @@ def test_drop_null_transformer_transform_boundary_pct_null_threshold():
drop_null_transformer = DropNullColumns(pct_null_threshold=1.0)
drop_null_transformer.fit(X)
X_t = drop_null_transformer.transform(X)
assert_frame_equal(X_t, X.drop(["all_null"], axis=1))
assert_frame_equal(
X_t,
X.drop(columns=["all_null"]).astype(
{"some_null": "Int64", "lots_of_null": "Int64"},
),
)
# check that X is untouched
assert X.equals(
pd.DataFrame(
Expand All @@ -112,7 +117,7 @@ def test_drop_null_transformer_fit_transform():
X = pd.DataFrame(
{"lots_of_null": [None, None, None, None, 5], "no_null": [1, 2, 3, 4, 5]},
)
X_expected = X.astype({"lots_of_null": "float64", "no_null": "int64"})
X_expected = X.astype({"lots_of_null": "Int64", "no_null": "int64"})
X_t = drop_null_transformer.fit_transform(X)
assert_frame_equal(X_expected, X_t)

Expand Down Expand Up @@ -152,6 +157,11 @@ def test_drop_null_transformer_fit_transform():
"lots_of_null": [None, None, None, None, 5],
"some_null": [None, 0, 3, 4, 5],
},
).astype(
{
"lots_of_null": "Int64",
"some_null": "Int64",
},
)
drop_null_transformer = DropNullColumns(pct_null_threshold=1.0)
X_t = drop_null_transformer.fit_transform(X)
Expand Down
30 changes: 18 additions & 12 deletions evalml/tests/component_tests/test_imputer.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,11 @@
from pandas.testing import assert_frame_equal
from woodwork.logical_types import (
Boolean,
BooleanNullable,
Categorical,
Double,
Integer,
IntegerNullable,
NaturalLanguage,
)

Expand Down Expand Up @@ -88,14 +90,14 @@ def test_numeric_only_input(imputer_test_data):
expected = pd.DataFrame(
{
"int col": [0, 1, 2, 0, 3] * 4,
"float col": [0.0, 1.0, 0.0, -2.0, 5.0] * 4,
"float col": [0.1, 1.0, 0.0, -2.0, 5.0] * 4,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both these changes resulted from woodwork (or pandas) reading the column and interpreting it as actually an integer, despite the decimals included. I know we've played fast and loose with this before as I see many changes to tests in old PR's like the one to upgrade to woodwork 0.31.0 that throw a decimal place in a column of integers. Might need to file an issue against WW to change this behavior and add testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed this

"int with nan": [0.5, 1.0, 0.0, 0.0, 1.0] * 4,
"float with nan": [0.0, 1.0, 0, -1.0, 0.0] * 4,
"float with nan": [0.3, 1.0, 0.15, -1.0, 0.0] * 4,
},
)
assert_frame_equal(transformed, expected, check_dtype=False)

imputer = Imputer()
imputer = Imputer(numeric_impute_strategy="median")
transformed = imputer.fit_transform(X, y)
assert_frame_equal(transformed, expected, check_dtype=False)

Expand Down Expand Up @@ -158,14 +160,14 @@ def test_categorical_and_numeric_input(imputer_test_data):
),
"int col": [0, 1, 2, 0, 3] * 4,
"object col": pd.Series(["b", "b", "a", "c", "d"] * 4, dtype="category"),
"float col": [0.0, 1.0, 0.0, -2.0, 5.0] * 4,
"float col": [0.1, 1.0, 0.0, -2.0, 5.0] * 4,
"bool col": [True, False, False, True, True] * 4,
"categorical with nan": pd.Series(
["0", "1", "0", "0", "3"] * 4,
dtype="category",
),
"int with nan": [0.5, 1.0, 0.0, 0.0, 1.0] * 4,
"float with nan": [0.0, 1.0, 0, -1.0, 0.0] * 4,
"float with nan": [0.3, 1.0, 0.075, -1.0, 0.0] * 4,
"object with nan": pd.Series(
["b", "b", "b", "c", "b"] * 4,
dtype="category",
Expand Down Expand Up @@ -313,7 +315,7 @@ def test_imputer_fill_value(imputer_test_data):
["fill", "1", "0", "0", "3"] * 4,
dtype="category",
),
"float with nan": [0.0, 1.0, -1, -1.0, 0.0] * 4,
"float with nan": [0.3, 1.0, -1, -1.0, 0.0] * 4,
"object with nan": pd.Series(
["b", "b", "fill", "c", "fill"] * 4,
dtype="category",
Expand Down Expand Up @@ -512,7 +514,7 @@ def test_imputer_all_bool_return_original(data_type, make_data_type):
def test_imputer_bool_dtype_object(data_type, make_data_type):
X = pd.DataFrame([True, np.nan, False, np.nan, True] * 4)
y = pd.Series([1, 0, 0, 1, 0] * 4)
X_expected_arr = pd.DataFrame([True, True, False, True, True] * 4, dtype="category")
X_expected_arr = pd.DataFrame([True, True, False, True, True] * 4, dtype="boolean")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the new woodwork behavior we would expect - the returning of the new boolean dtype, which is basically the standard bool but nullable.

X = make_data_type(data_type, X)
y = make_data_type(data_type, y)
imputer = Imputer()
Expand All @@ -537,7 +539,7 @@ def test_imputer_multitype_with_one_bool(data_type, make_data_type):
{
"bool with nan": pd.Series(
[True, False, False, False, False] * 4,
dtype="category",
dtype="boolean",
),
"bool no nan": pd.Series(
[False, False, False, False, True] * 4,
Expand All @@ -563,7 +565,9 @@ def test_imputer_int_preserved():
transformed,
pd.DataFrame(pd.Series([1, 2, 11, 14 / 3])),
)
assert {k: type(v) for k, v in transformed.ww.logical_types.items()} == {0: Double}
assert {k: type(v) for k, v in transformed.ww.logical_types.items()} == {
0: IntegerNullable,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just how the new woodwork inference manifests in terms of woodwork dtypes. Before, none types were the numpy.nan but are now pd.NA and, depending on the other values in the feature, will be the woodwork dtype of IntegerNullable or BooleanNullable.

}

X = pd.DataFrame(pd.Series([1, 2, 3, np.nan]))
imputer = Imputer(numeric_impute_strategy="mean")
Expand All @@ -573,7 +577,9 @@ def test_imputer_int_preserved():
pd.DataFrame(pd.Series([1, 2, 3, 2])),
check_dtype=False,
)
assert {k: type(v) for k, v in transformed.ww.logical_types.items()} == {0: Double}
assert {k: type(v) for k, v in transformed.ww.logical_types.items()} == {
0: IntegerNullable,
}

X = pd.DataFrame(pd.Series([1, 2, 3, 4], dtype="int"))
imputer = Imputer(numeric_impute_strategy="mean")
Expand All @@ -595,9 +601,9 @@ def test_imputer_bool_preserved(test_case, null_type):
]
X = pd.DataFrame(pd.Series([True, False, True, null_type] * 4))
expected = pd.DataFrame(
pd.Series([True, False, True, True] * 4, dtype="category"),
pd.Series([True, False, True, True] * 4, dtype="boolean"),
)
expected_ww_dtype = Categorical
expected_ww_dtype = BooleanNullable
check_dtype = True
elif test_case == "boolean_without_null":
X = pd.DataFrame(pd.Series([True, False, True, False] * 4))
Expand Down
21 changes: 12 additions & 9 deletions evalml/tests/component_tests/test_per_column_imputer.py
Original file line number Diff line number Diff line change
Expand Up @@ -219,15 +219,18 @@ def test_fit_transform_drop_all_nan_columns():
"another_col": {"impute_strategy": "most_frequent"},
}
transformer = PerColumnImputer(impute_strategies=strategies)
X_expected_arr = pd.DataFrame({"some_nan": [0, 1, 0], "another_col": [0, 1, 2]})
X_expected_arr = pd.DataFrame(
{"some_nan": [0, 1, 0], "another_col": [0, 1, 2]},
).astype({"some_nan": "Int64"})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The imputer returns the new nullable boolean.

X_t = transformer.fit_transform(X)
assert_frame_equal(X_expected_arr, X_t, check_dtype=False)
# Check that original dataframe remains unchanged
assert_frame_equal(
X,
pd.DataFrame(
{
"all_nan": [np.nan, np.nan, np.nan],
"some_nan": [0.0, 1.0, 0.0],
"some_nan": [0, 1, 0],
"another_col": [0, 1, 2],
},
),
Expand Down Expand Up @@ -259,7 +262,7 @@ def test_transform_drop_all_nan_columns():
pd.DataFrame(
{
"all_nan": [np.nan, np.nan, np.nan],
"some_nan": [0.0, 1.0, 0.0],
"some_nan": [0, 1, 0],
"another_col": [0, 1, 2],
},
),
Expand Down Expand Up @@ -347,8 +350,9 @@ def test_per_column_imputer_column_subset():
)
X_expected.ww.init(
logical_types={
"all_nan_not_included": "double",
"column_with_nan_included": "double",
"all_nan_not_included": "Double",
"column_with_nan_not_included": "IntegerNullable",
"column_with_nan_included": "IntegerNullable",
},
)
X.ww.init(
Expand All @@ -362,11 +366,10 @@ def test_per_column_imputer_column_subset():
{
"all_nan_not_included": [np.nan, np.nan, np.nan],
"all_nan_included": [np.nan, np.nan, np.nan],
"column_with_nan_not_included": [np.nan, 1, 0],
# Because of https://github.com/alteryx/evalml/issues/2055
"column_with_nan_included": [0.0, 1.0, 0.0],
"column_with_nan_not_included": [pd.NA, 1, 0],
"column_with_nan_included": [0, 1, 0],
},
),
).astype({"column_with_nan_not_included": "Int64"}),
)


Expand Down
Loading