-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FAKE] GMM IC PR for comment #43
base: main
Are you sure you want to change the base?
Conversation
Different combinations of initialization, GMM, | ||
and cluster numbers are used and the clustering | ||
with the best selection criterion (BIC or AIC) is chosen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest making this match LassoLarsIC a bit closer, eg "Such criteria are useful to select the value of the regularization parameter by making a trade-off between the goodness of fit and the complexity of the model." could basically replace "regularization parameter" with "gaussian mixture parameters"
n_init : int, optional (default = 1) | ||
If ``n_init`` is larger than 1, additional | ||
``n_init``-1 runs of :class:`sklearn.mixture.GaussianMixture` | ||
initialized with k-means will be performed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessarily initialized with k-means, right?
initialized with k-means will be performed | ||
for all covariance parameters in ``covariance_type``. | ||
|
||
init_params : {‘kmeans’ (default), ‘k-means++’, ‘random’, ‘random_from_data’} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps worth explaining the options, mainly i dont know what random_from_data is from this description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, is kmeans ++ not the default? if not, why not? i think it is in sklearn if i remember correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, not sure, apparently kmeans is the default in GaussianMixture
|
||
Attributes | ||
---------- | ||
best_criterion_ : float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lasso lars IC calls this "criterion_"
covariance_type_ : str | ||
Covariance type for the model with the best bic/aic. | ||
|
||
best_model_ : :class:`sklearn.mixture.GaussianMixture` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in lassolarsIC, there is no "sub-object" with the best model; rather the whole class just operates as if it is that model. does that make sense? while i cant speak for them, my guess is this is closer to what they'd be expecting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add the attributes like weights_
, means_
from GaussianMixture into GaussianMixtureIC, but I found that I still need to save the best model (I call best_estimator_
in the newest version) in order to all predict
. Did I understand you correctly?
best_model_ : :class:`sklearn.mixture.GaussianMixture` | ||
Object with the best bic/aic. | ||
|
||
labels_ : array-like, shape (n_samples,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a property of GaussianMixture, recommend not storing
self.criterion = criterion | ||
self.n_jobs = n_jobs | ||
|
||
def _check_multi_comp_inputs(self, input, name, default): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i usually make any methods that dont access self
into functions
name="min_components", | ||
target_type=int, | ||
) | ||
check_scalar( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min value could be "min_components"?
else: | ||
criterion_value = model.aic(X) | ||
|
||
# change the precision of "criterion_value" based on sample size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain this?
) | ||
best_criter = [result.criterion for result in results] | ||
|
||
if sum(best_criter == np.min(best_criter)) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this all seems fine but just a suggestion - https://numpy.org/doc/stable/reference/generated/numpy.argmin.html
docs imply that for ties, argmin gives the first. so in other words if results are sorted in order of complexity, just using argmin would do what you want. (can even leave a comment to this effect, if you go this route).
note that i think having the results sorted by complexity anyway is probably desireable?
|
||
|
||
|
||
class _CollectResults: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is effectively a dictionary - recommend just using one, or a named tuple? i am just anti classes that only store data and dont have any methods, but that is just my style :)
param_grid = dict( | ||
covariance_type=covariance_type, | ||
n_components=range(self.min_components, self.max_components + 1), | ||
) | ||
param_grid = list(ParameterGrid(param_grid)) | ||
|
||
seeds = random_state.randint(np.iinfo(np.int32).max, size=len(param_grid)) | ||
|
||
if parse_version(joblib.__version__) < parse_version("0.12"): | ||
parallel_kwargs = {"backend": "threading"} | ||
else: | ||
parallel_kwargs = {"prefer": "threads"} | ||
|
||
results = Parallel(n_jobs=self.n_jobs, verbose=self.verbose, **parallel_kwargs)( | ||
delayed(self._fit_cluster)(X, gm_params, seed) | ||
for gm_params, seed in zip(param_grid, seeds) | ||
) | ||
best_criter = [result.criterion for result in results] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just use GridSearchCV as in their example? https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html#sphx-glr-auto-examples-mixture-plot-gmm-selection-py
it would abstract away some of the stuff you have to do to make parallel work, for instance
❌ Linting issuesThis PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling You can see the details of the linting issues under the
|
Co-authored-by: Conrad <[email protected]>
…scikit-learn#28701) Co-authored-by: Adrin Jalali <[email protected]> Co-authored-by: Omar Salman <[email protected]>
Co-authored-by: Jérémie du Boisberranger <[email protected]> Co-authored-by: Christian Lorentzen <[email protected]>
…29004) Co-authored-by: Jérémie du Boisberranger <[email protected]>
scikit-learn#29008) Co-authored-by: Tim Head <[email protected]>
…it-learn#29011) Co-authored-by: Adrin Jalali <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Arturo Amor <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
…d without a y argument (scikit-learn#29402) Co-authored-by: Loïc Estève <[email protected]> Co-authored-by: Lucy Liu <[email protected]>
…nd local caching (scikit-learn#29354) Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Loïc Estève <[email protected]>
…learn#29433) Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Loïc Estève <[email protected]>
Reference Issues/PRs
What does this implement/fix? Explain your changes.
Any other comments?