Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FAKE] GMM IC PR for comment #43

Open
wants to merge 353 commits into
base: main
Choose a base branch
from
Open

[FAKE] GMM IC PR for comment #43

wants to merge 353 commits into from

Conversation

bdpedigo
Copy link

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Comment on lines 29 to 31
Different combinations of initialization, GMM,
and cluster numbers are used and the clustering
with the best selection criterion (BIC or AIC) is chosen.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest making this match LassoLarsIC a bit closer, eg "Such criteria are useful to select the value of the regularization parameter by making a trade-off between the goodness of fit and the complexity of the model." could basically replace "regularization parameter" with "gaussian mixture parameters"

n_init : int, optional (default = 1)
If ``n_init`` is larger than 1, additional
``n_init``-1 runs of :class:`sklearn.mixture.GaussianMixture`
initialized with k-means will be performed
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessarily initialized with k-means, right?

initialized with k-means will be performed
for all covariance parameters in ``covariance_type``.

init_params : {‘kmeans’ (default), ‘k-means++’, ‘random’, ‘random_from_data’}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps worth explaining the options, mainly i dont know what random_from_data is from this description

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, is kmeans ++ not the default? if not, why not? i think it is in sklearn if i remember correctly

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, not sure, apparently kmeans is the default in GaussianMixture


Attributes
----------
best_criterion_ : float
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lasso lars IC calls this "criterion_"

covariance_type_ : str
Covariance type for the model with the best bic/aic.

best_model_ : :class:`sklearn.mixture.GaussianMixture`
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in lassolarsIC, there is no "sub-object" with the best model; rather the whole class just operates as if it is that model. does that make sense? while i cant speak for them, my guess is this is closer to what they'd be expecting

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add the attributes like weights_, means_ from GaussianMixture into GaussianMixtureIC, but I found that I still need to save the best model (I call best_estimator_ in the newest version) in order to all predict. Did I understand you correctly?

best_model_ : :class:`sklearn.mixture.GaussianMixture`
Object with the best bic/aic.

labels_ : array-like, shape (n_samples,)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a property of GaussianMixture, recommend not storing

self.criterion = criterion
self.n_jobs = n_jobs

def _check_multi_comp_inputs(self, input, name, default):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i usually make any methods that dont access self into functions

name="min_components",
target_type=int,
)
check_scalar(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min value could be "min_components"?

else:
criterion_value = model.aic(X)

# change the precision of "criterion_value" based on sample size
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain this?

)
best_criter = [result.criterion for result in results]

if sum(best_criter == np.min(best_criter)) == 1:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this all seems fine but just a suggestion - https://numpy.org/doc/stable/reference/generated/numpy.argmin.html
docs imply that for ties, argmin gives the first. so in other words if results are sorted in order of complexity, just using argmin would do what you want. (can even leave a comment to this effect, if you go this route).

note that i think having the results sorted by complexity anyway is probably desireable?




class _CollectResults:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is effectively a dictionary - recommend just using one, or a named tuple? i am just anti classes that only store data and dont have any methods, but that is just my style :)

Comment on lines 306 to 323
param_grid = dict(
covariance_type=covariance_type,
n_components=range(self.min_components, self.max_components + 1),
)
param_grid = list(ParameterGrid(param_grid))

seeds = random_state.randint(np.iinfo(np.int32).max, size=len(param_grid))

if parse_version(joblib.__version__) < parse_version("0.12"):
parallel_kwargs = {"backend": "threading"}
else:
parallel_kwargs = {"prefer": "threads"}

results = Parallel(n_jobs=self.n_jobs, verbose=self.verbose, **parallel_kwargs)(
delayed(self._fit_cluster)(X, gm_params, seed)
for gm_params, seed in zip(param_grid, seeds)
)
best_criter = [result.criterion for result in results]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just use GridSearchCV as in their example? https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html#sphx-glr-auto-examples-mixture-plot-gmm-selection-py

it would abstract away some of the stuff you have to do to make parallel work, for instance

@github-actions
Copy link

github-actions bot commented Jun 21, 2023

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here


ruff

ruff detected issues. Please run ruff --fix --output-format=full . locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.5.1.


examples/linear_model/plot_tweedie_regression_insurance_claims.py:82:35: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
   |
81 |     # unquote string fields
82 |     for column_name in df.columns[df.dtypes.values == object]:
   |                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
83 |         df[column_name] = df[column_name].str.strip("'")
84 |     return df.iloc[:n_samples]
   |

sklearn/cluster/_optics.py:327:12: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
    |
325 |         """
326 |         dtype = bool if self.metric in PAIRWISE_BOOLEAN_FUNCTIONS else float
327 |         if dtype == bool and X.dtype != bool:
    |            ^^^^^^^^^^^^^ E721
328 |             msg = (
329 |                 "Data will be converted to boolean for"
    |

sklearn/cluster/tests/test_dbscan.py:294:12: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
    |
292 |     obj = DBSCAN()
293 |     s = pickle.dumps(obj)
294 |     assert type(pickle.loads(s)) == obj.__class__
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
    |

sklearn/linear_model/tests/test_ridge.py:1023:12: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
     |
1022 |     assert len(ridge_cv.coef_.shape) == 1
1023 |     assert type(ridge_cv.intercept_) == np.float64
     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
1024 | 
1025 |     cv = KFold(5)
     |

sklearn/linear_model/tests/test_ridge.py:1031:12: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
     |
1030 |     assert len(ridge_cv.coef_.shape) == 1
1031 |     assert type(ridge_cv.intercept_) == np.float64
     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
     |

sklearn/metrics/pairwise.py:2391:12: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
     |
2389 |         dtype = bool if metric in PAIRWISE_BOOLEAN_FUNCTIONS else "infer_float"
2390 | 
2391 |         if dtype == bool and (X.dtype != bool or (Y is not None and Y.dtype != bool)):
     |            ^^^^^^^^^^^^^ E721
2392 |             msg = "Data was converted to boolean for metric %s" % metric
2393 |             warnings.warn(msg, DataConversionWarning)
     |

sklearn/model_selection/_split.py:2938:27: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
     |
2936 |                 if value is None and hasattr(self, "cvargs"):
2937 |                     value = self.cvargs.get(key, None)
2938 |             if len(w) and w[0].category == FutureWarning:
     |                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
2939 |                 # if the parameter is deprecated, don't show it
2940 |                 continue
     |

sklearn/model_selection/tests/test_validation.py:589:20: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
    |
588 |             # Make sure all the arrays are of np.ndarray type
589 |             assert type(cv_results["test_r2"]) == np.ndarray
    |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
590 |             assert type(cv_results["test_neg_mean_squared_error"]) == np.ndarray
591 |             assert type(cv_results["fit_time"]) == np.ndarray
    |

sklearn/model_selection/tests/test_validation.py:590:20: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
    |
588 |             # Make sure all the arrays are of np.ndarray type
589 |             assert type(cv_results["test_r2"]) == np.ndarray
590 |             assert type(cv_results["test_neg_mean_squared_error"]) == np.ndarray
    |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
591 |             assert type(cv_results["fit_time"]) == np.ndarray
592 |             assert type(cv_results["score_time"]) == np.ndarray
    |

sklearn/model_selection/tests/test_validation.py:591:20: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
    |
589 |             assert type(cv_results["test_r2"]) == np.ndarray
590 |             assert type(cv_results["test_neg_mean_squared_error"]) == np.ndarray
591 |             assert type(cv_results["fit_time"]) == np.ndarray
    |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
592 |             assert type(cv_results["score_time"]) == np.ndarray
    |

sklearn/model_selection/tests/test_validation.py:592:20: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
    |
590 |             assert type(cv_results["test_neg_mean_squared_error"]) == np.ndarray
591 |             assert type(cv_results["fit_time"]) == np.ndarray
592 |             assert type(cv_results["score_time"]) == np.ndarray
    |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
593 | 
594 |             # Ensure all the times are within sane limits
    |

sklearn/utils/estimator_checks.py:1504:8: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
     |
1503 |     # func can output tuple (e.g. score_samples)
1504 |     if type(result_full) == tuple:
     |        ^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
1505 |         result_full = result_full[0]
1506 |         result_by_batch = list(map(lambda x: x[0], result_by_batch))
     |

sklearn/utils/tests/test_validation.py:1344:12: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
     |
1342 |         )
1343 |     assert str(raised_error.value) == str(err_msg)
1344 |     assert type(raised_error.value) == type(err_msg)
     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
     |

sklearn/utils/validation.py:882:49: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
    |
880 |         if all(isinstance(dtype_iter, np.dtype) for dtype_iter in dtypes_orig):
881 |             dtype_orig = np.result_type(*dtypes_orig)
882 |         elif pandas_requires_conversion and any(d == object for d in dtypes_orig):
    |                                                 ^^^^^^^^^^^ E721
883 |             # Force object if any of the dtypes is an object
884 |             dtype_orig = object
    |

Found 14 errors.

Generated for commit: e2f9a77. Link to the linter CI: here

lithomas1 and others added 25 commits May 8, 2024 11:21
Co-authored-by: Jérémie du Boisberranger <[email protected]>
Co-authored-by: Christian Lorentzen <[email protected]>
lesteve and others added 30 commits July 9, 2024 15:13
…d without a y argument (scikit-learn#29402)

Co-authored-by: Loïc Estève <[email protected]>
Co-authored-by: Lucy Liu <[email protected]>
…nd local caching (scikit-learn#29354)

Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Loïc Estève <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.