Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] pca: n_features_ attribute of decomposition.PCA is deprecated in favor of n_features_in_ #6249

Merged
merged 1 commit into from
Apr 21, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions Orange/projection/pca.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,12 +99,12 @@ class ImprovedPCA(skl_decomposition.PCA):
# pylint: disable=too-many-branches
def _fit(self, X):
"""Dispatch to the right submethod depending on the chosen solver."""
X = check_array(
X = self._validate_data(
X,
accept_sparse=["csr", "csc"],
dtype=[np.float64, np.float32],
ensure_2d=True,
copy=self.copy,
reset=False,
accept_sparse=["csr", "csc"],
copy=self.copy
)

# Handle n_components==None
Expand Down Expand Up @@ -201,7 +201,7 @@ def _fit_truncated(self, X, n_components, svd_solver):
random_state=random_state,
)

self.n_samples_, self.n_features_ = n_samples, n_features
self.n_samples_ = n_samples
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works because:

Most of the code in ImprovedPCA is copied directly from sklearn.decomposition.PCA. This was done because it wasn't possible to add the randomized algorithm that enables PCA on sparse data without modifying these functions.
Until recently, sklearn's PCA used self.n_features_. However, they recently switched over to using self.n_features_in, which is more generally used in sklearn, and is part of their BaseEstimator class. The self.n_features_in attribute is set in BaseEstimator._check_n_features, which is, in turn, called by BaseEstimator._validate_data. In scikit-learn's PCA, this method is called in PCA._fit, which ensures that the self.n_features_in attribute is set on sklearn's implementation.

They also deprecated self.n_features and replaced it with a readonly property which just returns self.n_featuers_in anyways. This is still the same functionality as before, but you have to trace it through these different methods. So, this is still backwards compatible.

So, we can do the same thing in our ImprovedPCA class. We can get rid of self.n_features_ = features, since this will be set when we call self._validate_data, which is inherited from sklearn.decomposition.PCA <- sklearn.BaseEstimator. So, the other change we need is to replace our previous call to check_array to self._validate_data. And I believe this should ensure the exact same functionality as before.

Copy link
Collaborator

@pavlin-policar pavlin-policar Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transform method should also probably be updated to reflect how sklearn is doing things now.

So, this would mean changing

        X = check_array(
            X,
            accept_sparse=["csr", "csc"],
            dtype=[np.float64, np.float32],
            ensure_2d=True,
            copy=self.copy,
        )

on lines 224-230 to

X = self._validate_data(
    X,
    dtype=[np.float64, np.float32],
    reset=False,
    accept_sparse=["csr", "csc"],
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavlin-policar, thanks for the explanation.

self.components_ = V
self.n_components_ = n_components

Expand All @@ -221,12 +221,12 @@ def _fit_truncated(self, X, n_components, svd_solver):
def transform(self, X):
check_is_fitted(self, ["mean_", "components_"], all_or_any=all)

X = check_array(
X = self._validate_data(
X,
accept_sparse=["csr", "csc"],
dtype=[np.float64, np.float32],
ensure_2d=True,
copy=self.copy,
reset=False,
copy=self.copy
)

if self.mean_ is not None:
Expand Down