[FAKE] GMM IC PR for comment #43

bdpedigo · 2023-05-30T14:32:34Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

bdpedigo · 2023-05-30T14:34:13Z

sklearn/mixture/_gaussian_mixture_ic.py

+    Different combinations of initialization, GMM,
+    and cluster numbers are used and the clustering
+    with the best selection criterion (BIC or AIC) is chosen.


suggest making this match LassoLarsIC a bit closer, eg "Such criteria are useful to select the value of the regularization parameter by making a trade-off between the goodness of fit and the complexity of the model." could basically replace "regularization parameter" with "gaussian mixture parameters"

bdpedigo · 2023-05-30T14:35:13Z

sklearn/mixture/_gaussian_mixture_ic.py

+    n_init : int, optional (default = 1)
+        If ``n_init`` is larger than 1, additional
+        ``n_init``-1 runs of :class:`sklearn.mixture.GaussianMixture`
+        initialized with k-means will be performed


not necessarily initialized with k-means, right?

bdpedigo · 2023-05-30T14:35:41Z

sklearn/mixture/_gaussian_mixture_ic.py

+        initialized with k-means will be performed
+        for all covariance parameters in ``covariance_type``.
+
+    init_params : {‘kmeans’ (default), ‘k-means++’, ‘random’, ‘random_from_data’}


perhaps worth explaining the options, mainly i dont know what random_from_data is from this description

also, is kmeans ++ not the default? if not, why not? i think it is in sklearn if i remember correctly

yeah, not sure, apparently kmeans is the default in GaussianMixture

bdpedigo · 2023-05-30T14:37:30Z

sklearn/mixture/_gaussian_mixture_ic.py

+
+    Attributes
+    ----------
+    best_criterion_ : float


lasso lars IC calls this "criterion_"

bdpedigo · 2023-05-30T14:38:59Z

sklearn/mixture/_gaussian_mixture_ic.py

+    covariance_type_ : str
+        Covariance type for the model with the best bic/aic.
+
+    best_model_ : :class:`sklearn.mixture.GaussianMixture`


in lassolarsIC, there is no "sub-object" with the best model; rather the whole class just operates as if it is that model. does that make sense? while i cant speak for them, my guess is this is closer to what they'd be expecting

I add the attributes like weights_, means_ from GaussianMixture into GaussianMixtureIC, but I found that I still need to save the best model (I call best_estimator_ in the newest version) in order to all predict. Did I understand you correctly?

bdpedigo · 2023-05-30T14:39:35Z

sklearn/mixture/_gaussian_mixture_ic.py

+    best_model_ : :class:`sklearn.mixture.GaussianMixture`
+        Object with the best bic/aic.
+
+    labels_ : array-like, shape (n_samples,)


not a property of GaussianMixture, recommend not storing

bdpedigo · 2023-05-30T14:40:50Z

sklearn/mixture/_gaussian_mixture_ic.py

+        self.criterion = criterion
+        self.n_jobs = n_jobs
+
+    def _check_multi_comp_inputs(self, input, name, default):


i usually make any methods that dont access self into functions

bdpedigo · 2023-05-30T14:41:55Z

sklearn/mixture/_gaussian_mixture_ic.py

+            name="min_components",
+            target_type=int,
+        )
+        check_scalar(


min value could be "min_components"?

bdpedigo · 2023-05-30T14:42:54Z

sklearn/mixture/_gaussian_mixture_ic.py

+        else:
+            criterion_value = model.aic(X)
+
+        # change the precision of "criterion_value" based on sample size


could you explain this?

bdpedigo · 2023-05-30T14:45:46Z

sklearn/mixture/_gaussian_mixture_ic.py

+        )
+        best_criter = [result.criterion for result in results]
+
+        if sum(best_criter == np.min(best_criter)) == 1:


this all seems fine but just a suggestion - https://numpy.org/doc/stable/reference/generated/numpy.argmin.html
docs imply that for ties, argmin gives the first. so in other words if results are sorted in order of complexity, just using argmin would do what you want. (can even leave a comment to this effect, if you go this route).

note that i think having the results sorted by complexity anyway is probably desireable?

bdpedigo · 2023-05-30T14:47:34Z

sklearn/mixture/_gaussian_mixture_ic.py

+
+
+
+class _CollectResults:


this is effectively a dictionary - recommend just using one, or a named tuple? i am just anti classes that only store data and dont have any methods, but that is just my style :)

bdpedigo · 2023-05-30T14:51:45Z

sklearn/mixture/_gaussian_mixture_ic.py

+        param_grid = dict(
+            covariance_type=covariance_type,
+            n_components=range(self.min_components, self.max_components + 1),
+        )
+        param_grid = list(ParameterGrid(param_grid))
+
+        seeds = random_state.randint(np.iinfo(np.int32).max, size=len(param_grid))
+
+        if parse_version(joblib.__version__) < parse_version("0.12"):
+            parallel_kwargs = {"backend": "threading"}
+        else:
+            parallel_kwargs = {"prefer": "threads"}
+
+        results = Parallel(n_jobs=self.n_jobs, verbose=self.verbose, **parallel_kwargs)(
+            delayed(self._fit_cluster)(X, gm_params, seed)
+            for gm_params, seed in zip(param_grid, seeds)
+        )
+        best_criter = [result.criterion for result in results]


why not just use GridSearchCV as in their example? https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html#sphx-glr-auto-examples-mixture-plot-gmm-selection-py

it would abstract away some of the stuff you have to do to make parallel work, for instance

github-actions · 2023-06-21T16:51:49Z

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here

`ruff`

ruff detected issues. Please run ruff --fix --output-format=full . locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.5.1.


examples/linear_model/plot_tweedie_regression_insurance_claims.py:82:35: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
   |
81 |     # unquote string fields
82 |     for column_name in df.columns[df.dtypes.values == object]:
   |                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
83 |         df[column_name] = df[column_name].str.strip("'")
84 |     return df.iloc[:n_samples]
   |

sklearn/cluster/_optics.py:327:12: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
    |
325 |         """
326 |         dtype = bool if self.metric in PAIRWISE_BOOLEAN_FUNCTIONS else float
327 |         if dtype == bool and X.dtype != bool:
    |            ^^^^^^^^^^^^^ E721
328 |             msg = (
329 |                 "Data will be converted to boolean for"
    |

sklearn/cluster/tests/test_dbscan.py:294:12: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
    |
292 |     obj = DBSCAN()
293 |     s = pickle.dumps(obj)
294 |     assert type(pickle.loads(s)) == obj.__class__
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
    |

sklearn/linear_model/tests/test_ridge.py:1023:12: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
     |
1022 |     assert len(ridge_cv.coef_.shape) == 1
1023 |     assert type(ridge_cv.intercept_) == np.float64
     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
1024 | 
1025 |     cv = KFold(5)
     |

sklearn/linear_model/tests/test_ridge.py:1031:12: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
     |
1030 |     assert len(ridge_cv.coef_.shape) == 1
1031 |     assert type(ridge_cv.intercept_) == np.float64
     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
     |

sklearn/metrics/pairwise.py:2391:12: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
     |
2389 |         dtype = bool if metric in PAIRWISE_BOOLEAN_FUNCTIONS else "infer_float"
2390 | 
2391 |         if dtype == bool and (X.dtype != bool or (Y is not None and Y.dtype != bool)):
     |            ^^^^^^^^^^^^^ E721
2392 |             msg = "Data was converted to boolean for metric %s" % metric
2393 |             warnings.warn(msg, DataConversionWarning)
     |

sklearn/model_selection/_split.py:2938:27: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
     |
2936 |                 if value is None and hasattr(self, "cvargs"):
2937 |                     value = self.cvargs.get(key, None)
2938 |             if len(w) and w[0].category == FutureWarning:
     |                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
2939 |                 # if the parameter is deprecated, don't show it
2940 |                 continue
     |

sklearn/model_selection/tests/test_validation.py:589:20: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
    |
588 |             # Make sure all the arrays are of np.ndarray type
589 |             assert type(cv_results["test_r2"]) == np.ndarray
    |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
590 |             assert type(cv_results["test_neg_mean_squared_error"]) == np.ndarray
591 |             assert type(cv_results["fit_time"]) == np.ndarray
    |

sklearn/model_selection/tests/test_validation.py:590:20: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
    |
588 |             # Make sure all the arrays are of np.ndarray type
589 |             assert type(cv_results["test_r2"]) == np.ndarray
590 |             assert type(cv_results["test_neg_mean_squared_error"]) == np.ndarray
    |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
591 |             assert type(cv_results["fit_time"]) == np.ndarray
592 |             assert type(cv_results["score_time"]) == np.ndarray
    |

sklearn/model_selection/tests/test_validation.py:591:20: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
    |
589 |             assert type(cv_results["test_r2"]) == np.ndarray
590 |             assert type(cv_results["test_neg_mean_squared_error"]) == np.ndarray
591 |             assert type(cv_results["fit_time"]) == np.ndarray
    |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
592 |             assert type(cv_results["score_time"]) == np.ndarray
    |

sklearn/model_selection/tests/test_validation.py:592:20: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
    |
590 |             assert type(cv_results["test_neg_mean_squared_error"]) == np.ndarray
591 |             assert type(cv_results["fit_time"]) == np.ndarray
592 |             assert type(cv_results["score_time"]) == np.ndarray
    |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
593 | 
594 |             # Ensure all the times are within sane limits
    |

sklearn/utils/estimator_checks.py:1504:8: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
     |
1503 |     # func can output tuple (e.g. score_samples)
1504 |     if type(result_full) == tuple:
     |        ^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
1505 |         result_full = result_full[0]
1506 |         result_by_batch = list(map(lambda x: x[0], result_by_batch))
     |

sklearn/utils/tests/test_validation.py:1344:12: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
     |
1342 |         )
1343 |     assert str(raised_error.value) == str(err_msg)
1344 |     assert type(raised_error.value) == type(err_msg)
     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E721
     |

sklearn/utils/validation.py:882:49: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
    |
880 |         if all(isinstance(dtype_iter, np.dtype) for dtype_iter in dtypes_orig):
881 |             dtype_orig = np.result_type(*dtypes_orig)
882 |         elif pandas_requires_conversion and any(d == object for d in dtypes_orig):
    |                                                 ^^^^^^^^^^^ E721
883 |             # Force object if any of the dtypes is an object
884 |             dtype_orig = object
    |

Found 14 errors.

_{Generated for commit: e2f9a77. Link to the linter CI: here}

Co-authored-by: Conrad <[email protected]>

…scikit-learn#28701) Co-authored-by: Adrin Jalali <[email protected]> Co-authored-by: Omar Salman <[email protected]>

)

…kit-learn#28991)

…cikit-learn#28929)

…#29003)

Co-authored-by: Jérémie du Boisberranger <[email protected]> Co-authored-by: Christian Lorentzen <[email protected]>

…29004) Co-authored-by: Jérémie du Boisberranger <[email protected]>

scikit-learn#29008) Co-authored-by: Tim Head <[email protected]>

…it-learn#29011) Co-authored-by: Adrin Jalali <[email protected]>

…ue (scikit-learn#29015)

Co-authored-by: Guillaume Lemaitre <[email protected]>

…n#27736)

…r* (scikit-learn#29021)

…it-learn#29023)

Co-authored-by: Arturo Amor <[email protected]>

Co-authored-by: Olivier Grisel <[email protected]>

…ing (scikit-learn#29447)

…earn#29444)

Co-authored-by: Olivier Grisel <[email protected]>

…d without a y argument (scikit-learn#29402) Co-authored-by: Loïc Estève <[email protected]> Co-authored-by: Lucy Liu <[email protected]>

…ikit-learn#29266)

…nd local caching (scikit-learn#29354) Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Loïc Estève <[email protected]>

…learn#29433) Co-authored-by: Olivier Grisel <[email protected]>

Co-authored-by: Loïc Estève <[email protected]>

)

bdpedigo commented May 30, 2023

View reviewed changes

lithomas1 and others added 25 commits May 8, 2024 11:21

ENH Use Array API in mean_tweedie_deviance (scikit-learn#28106)

e12f192

Update supported python versions in docs (scikit-learn#28986)

1a54a11

DOC fix gp predic doc typo (scikit-learn#28987)

5e5cc34

Co-authored-by: Conrad <[email protected]>

MAINT: specify C17 as C standard in meson.build (scikit-learn#28980)

2d51510

MNT remove author and license in GLM files (scikit-learn#28799)

2d6e1be

FEA metadata routing for StackingClassifier and StackingRegressor (…

61281cf

…scikit-learn#28701) Co-authored-by: Adrin Jalali <[email protected]> Co-authored-by: Omar Salman <[email protected]>

DOC Update warm start example in ensemble user guide (scikit-learn#28998

c3d4c51

)

MAINT fix redirected link for Matthews Correlation Coefficient (sci…

c807092

…kit-learn#28991)

DOC Add links to digit denoising examples in docs and the user guide (s…

45ca0a7

…cikit-learn#28929)

🔒 🤖 CI Update lock files for cirrus-arm CI build(s) 🔒 🤖 (scikit-learn…

e2c3793

…#29003)

🔒 🤖 CI Update lock files for main CI build(s) 🔒 🤖 (scikit-learn#29005)

0967ec4

FIX 1d sparse array validation (scikit-learn#28988)

bca3634

Co-authored-by: Jérémie du Boisberranger <[email protected]> Co-authored-by: Christian Lorentzen <[email protected]>

🔒 🤖 CI Update lock files for scipy-dev CI build(s) 🔒 🤖 (scikit-learn#…

4449ded

…29004) Co-authored-by: Jérémie du Boisberranger <[email protected]>

CI Fix wheel builder windows (scikit-learn#29006)

64884f9

DOC persistence page revamp (scikit-learn#28889)

63f71ee

DOC Mention that Meson is the main supported way to build scikit-learn (

68b7598

scikit-learn#29008) Co-authored-by: Tim Head <[email protected]>

DOC More improvements to the documentation on model persistence (scik…

94ad8f3

…it-learn#29011) Co-authored-by: Adrin Jalali <[email protected]>

DOC Add warm start section for tree ensembles (scikit-learn#29001)

3ca9fc1

MNT Use c11 rather than c17 in meson.build to work-around Pyodide iss…

00db4df

…ue (scikit-learn#29015)

DOC fix dollar sign to euro sign (scikit-learn#29020)

28c9f50

Co-authored-by: Guillaume Lemaitre <[email protected]>

ENH Add Array API compatibility to mean_absolute_error (scikit-lear…

9f44f1f

…n#27736)

TST check compatibility with metadata routing for *ThresholdClassifie…

025d4b0

…r* (scikit-learn#29021)

ENH use Scipy isotonic_regression (scikit-learn#28897)

5c28a8e

DOC add link to sklearn_example_ensemeble_plot_adboost_twoclass (scik…

25cb305

…it-learn#29023)

DOC Fix default value of n in check_cv (scikit-learn#29024)

945273d

lesteve and others added 30 commits July 9, 2024 15:13

DOC Add lock-file command to help debug CI issues (scikit-learn#29435)

df25e37

Co-authored-by: Arturo Amor <[email protected]>

CI Move CUDA CI to pull_request trigger (scikit-learn#29376)

a6a5397

Co-authored-by: Olivier Grisel <[email protected]>

Fix tests for numpy 2 and array api compat (scikit-learn#29436)

a922568

Merge branch 'main' into gmIC

b2a3234

CI Unpin PyTorch in pylatest_conda_forge_mkl build (scikit-learn#29445)

4cc331f

FEA Support missing-values in ExtraTrees* (scikit-learn#28268)

775587b

DOC Link to right user guide section in StratifiedGroupKFold docstr…

fa14001

…ing (scikit-learn#29447)

ENH verbose >= 2 for per iteration info in HGBT (scikit-learn#28179)

c44457a

CI Update pylatest-pip-openblas-pandas build to Python 3.11 (scikit-l…

ab9f748

…earn#29444)

BLD Remove support for setuptools build (scikit-learn#29400)

7eb7eff

Co-authored-by: Olivier Grisel <[email protected]>

FIX Improve error message when RepeatedStratifiedKFold.split is calle…

20c7bd0

…d without a y argument (scikit-learn#29402) Co-authored-by: Loïc Estève <[email protected]> Co-authored-by: Lucy Liu <[email protected]>

FEAT SLEP006 permutation_test_score to support metadata routing (sc…

afee65a

…ikit-learn#29266)

ENH fetch_file to fetch data files by URL with retries, checksuming a…

2b2e290

…nd local caching (scikit-learn#29354) Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Loïc Estève <[email protected]>

ENH Array API support for euclidean_distances and rbf_kernel (scikit-…

e7af195

…learn#29433) Co-authored-by: Olivier Grisel <[email protected]>

Merge branch 'main' into gmIC

fc2b97d

fix clustering mismatch

f7c8773

Update v1.6.rst

b31fc57

fix linting

24fc234

fix linting

67378b0

fix docstring

a3e0966

Update _parameter_constraints

a393fa3

increase codecov

634aeb1

MAINT Remove scipy<1.6 specific code (scikit-learn#29461)

b0f86e7

CI Move label removal to a separate workflow (scikit-learn#29456)

409d187

BLD Make the version dynamic in pyproject.toml (scikit-learn#29399)

6bc7bc0

Co-authored-by: Loïc Estève <[email protected]>

DOC improve rendering of items in LDA (scikit-learn#29474)

97c3f3a

Cleanup obsolete code from example (scikit-learn#29478)

cc97b80

array API support for cosine_distances (scikit-learn#29265)

1813b4a

array API support for mean_absolute_percentage_error (scikit-learn#29300

dc6c01c

)

Merge branch 'main' into gmIC

e2f9a77

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FAKE] GMM IC PR for comment #43

[FAKE] GMM IC PR for comment #43

bdpedigo commented May 30, 2023

bdpedigo May 30, 2023

bdpedigo May 30, 2023

bdpedigo May 30, 2023

bdpedigo May 30, 2023

tingshanL Jun 21, 2023

bdpedigo May 30, 2023

bdpedigo May 30, 2023

tingshanL Jun 21, 2023

bdpedigo May 30, 2023

bdpedigo May 30, 2023

bdpedigo May 30, 2023

bdpedigo May 30, 2023

bdpedigo May 30, 2023

bdpedigo May 30, 2023

bdpedigo May 30, 2023

github-actions bot commented Jun 21, 2023 •

edited

Loading

[FAKE] GMM IC PR for comment #43

Are you sure you want to change the base?

[FAKE] GMM IC PR for comment #43

Conversation

bdpedigo commented May 30, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jun 21, 2023 • edited Loading

❌ Linting issues

ruff

github-actions bot commented Jun 21, 2023 •

edited

Loading

`ruff`