Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MESMER-X: refactor and test find first guess module #577

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

veni-vidi-vici-dormivi
Copy link
Collaborator

@veni-vidi-vici-dormivi veni-vidi-vici-dormivi commented Dec 6, 2024

I test the first guess functionality of distrib_cov and along this started refactoring the functions as well.
The refactoring includes:

  • converting some class attributes (self.var = ...) to locals
  • making some functions private
  • use the mean squared error in the loss functions fg_fun_* instead of the summed squared error for easier testing
  • some renaming and rearranging for better readability

I write tests for find_fg() testing several distributions, a few expressions and a few starting first guesses. Note that the tests generally test the simplest setup possible, like the standard normal or just the mean varying with the predictor. This is to test if we manage to find the right solution for the simplest use cases. As we know, it is impossible to test all expressions and distributions.

  • Tests added
  • Fully documented, including CHANGELOG.rst

Copy link

codecov bot commented Dec 6, 2024

Codecov Report

Attention: Patch coverage is 98.38710% with 1 line in your changes missing coverage. Please review.

Project coverage is 80.10%. Comparing base (8f75b2e) to head (3fccd28).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
mesmer/mesmer_x/train_l_distrib_mesmerx.py 98.38% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #577      +/-   ##
==========================================
+ Coverage   78.12%   80.10%   +1.98%     
==========================================
  Files          49       49              
  Lines        3012     3066      +54     
==========================================
+ Hits         2353     2456     +103     
+ Misses        659      610      -49     
Flag Coverage Δ
unittests 80.10% <98.38%> (+1.98%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +1083 to +1087
if not globalfit_all.success:
raise ValueError(
"Global optimization for first guess failed, please check boundaries_coeff or ",
"disable fg_with_global_opti in options_solver.",
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this could also only be a warning. Before, if it failed self.coeffs would just be None. @yquilcaille, could that become a problem at some point? I will also look into this a little more.

)
dist2.find_fg()
result2 = dist.fg_coeffs
np.testing.assert_equal(result2, result) # No
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing a first guess does not make the result any better in this case, but might in more difficult cases, or at least speed up the fitting, could also remove this test. Note that the user cannot provide a first guess in xr_train_distrib. We could think about adding this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
np.testing.assert_equal(result2, result) # No
# NOTE: leads to the same result as without first guess
np.testing.assert_equal(result2, result) # No

Comment on lines 109 to 111
# still finds a fg because we do not enforce the bounds on the fg
# however the fg is significantly worse on the param with the wrong bounds
# in contrast to the above this also runs step 6: fit on CDF or LL^n -> impications?
Copy link
Collaborator Author

@veni-vidi-vici-dormivi veni-vidi-vici-dormivi Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what to think about this... might not be a systematic problem....

@veni-vidi-vici-dormivi veni-vidi-vici-dormivi changed the title MESMER-X: test find first guess module MESMER-X: refactor and test find first guess module Dec 20, 2024
Copy link
Member

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not looked at the tests yet.

mesmer/mesmer_x/train_l_distrib_mesmerx.py Show resolved Hide resolved
mesmer/mesmer_x/train_l_distrib_mesmerx.py Outdated Show resolved Hide resolved
mesmer/mesmer_x/train_l_distrib_mesmerx.py Outdated Show resolved Hide resolved
mesmer/mesmer_x/train_l_distrib_mesmerx.py Outdated Show resolved Hide resolved
Comment on lines 956 to 957
ind_targ_low = np.where(smooth_targ < mean_minus_one_std)[0]
ind_targ_high = np.where(smooth_targ > mean_plus_one_std)[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use _sm and _lg? (you see I like when words have the same length) but also ok to keep

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where? and what would it mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sm(all) and l(ar)g(e) ( 🤔 ) and instead of low and high

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that I didn't get that maybe speaks again doing that 😅. I see your point, we could also just go with small and large but it's quite long oc. I vote for keep as is.

mesmer/mesmer_x/train_l_distrib_mesmerx.py Outdated Show resolved Hide resolved
mesmer/mesmer_x/train_l_distrib_mesmerx.py Outdated Show resolved Hide resolved
mesmer/mesmer_x/train_l_distrib_mesmerx.py Outdated Show resolved Hide resolved
mesmer/mesmer_x/train_l_distrib_mesmerx.py Outdated Show resolved Hide resolved
Comment on lines +1012 to +1015
# location might not be used (beta distribution) or set in the expression
if len(self.fg_ind_loc) > 0:
localfit_loc = self._minimize(
func=self._fg_fun_loc,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to enable finding a first guess also for expr = Expression("loc=0, scale=c1", expr_name="name". This is a rather major change and it also led me to add two tests to test_mesmer_x_expression and found a bug (#525 (comment)), so this might deserve its own PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes may be easier to see what is happening if done in a separate PR

Copy link
Member

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good - thanks. But it's such a large beast so super difficult to have to overview. And also difficult to know what is a limitation of the data and distribution and what one of the code...

@@ -261,6 +257,16 @@ def np_train_distrib(
return dfit.coefficients_fit, dfit.quality_fit


def _smooth_data(data, nn=10):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _smooth_data(data, nn=10):
def _smooth_data(data, length=10):

(or size or something similar)

Comment on lines +1012 to +1015
# location might not be used (beta distribution) or set in the expression
if len(self.fg_ind_loc) > 0:
localfit_loc = self._minimize(
func=self._fg_fun_loc,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes may be easier to see what is happening if done in a separate PR

x0=self.fg_coeffs[self.fg_ind_loc],
fact_maxfev_iter=len(self.fg_ind_loc) / self.n_coeffs,
option_NelderMead="best_run",
[self.expr_fit.coefficients_list.index(c) for c in loc_coeffs]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok to keep but that would better be done in Expression I think

# compared to all 0, better for ref level but worse for trend
x0 = np.full(len(scale), fill_value=np.std(self.data_targ))
# scale might not be used (beta distribution) or set in the expression
if len(self.fg_ind_sca) > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if len(self.fg_ind_sca) > 0:
if len(self.fg_ind_sca) > 0:

pred = np.ones(n)
targ = rng.normal(loc=0, scale=1, size=n)

expression = Expression("norm(loc=c1, scale=c3)", expr_name="exp1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expression = Expression("norm(loc=c1, scale=c3)", expr_name="exp1")
expression = Expression("norm(loc=c1, scale=c2)", expr_name="exp1")

not that is matters...

expected = [loc, scale]

np.testing.assert_allclose(result, expected, rtol=0.1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test 1-2 discrete distributions?

expected = np.array([-0.005093813, 1.015267311])
np.testing.assert_allclose(result, expected, rtol=1e-5)

# test with wrong bounds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# test with wrong bounds
# test with bounds outside true value

)
dist.find_fg()
result = dist.fg_coeffs
expected = np.array([-0.005093817, 1.015267298])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here we get an estimate outside the bounds, correct?

expression = Expression("norm(loc=c1*__tas__, scale=c2)", expr_name="exp1")
dist = distrib_cov(targ, {"tas": pred}, expression)

smooth_targ = dist._smooth_data(targ)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't you turn that into a function?

dist.fg_ind_loc = np.array([0])

# test local minima at true coefficients
loss_at_toolow = dist._fg_fun_loc(c1 - 1, targ)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a smaller $\Delta$?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants