-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduced precision in output for a number of regtests as they were fai… #950
base: master
Are you sure you want to change the base?
Conversation
Hello @gtribello I will take a look at the VES stuff. To clarify, is this only happening on your Mac? What setup are you using (OS version, compilers and version, etc)? Is it an Arm M2 or something like that? I also have a recent Arm M2 laptop, so I can try to reproduce the fails and fix the VES stuff. Omar |
Hello @valsson It is an Apple M2 Pro chip and I am running macOS Ventura. I am using the clang compilers that come with the developer tools. Good luck |
@gtribello some comments: basic/rt-make-1:I think this is a numerical error, it could happen with different architectures. Maybe you can remove these lines and reset. basic/rt-make-exceptions:This is unexpected. Can you paste the full python/rt-protein:It looks like dlopen would like to find a x86 library. I think that the loader is compiled with the same compiler used to compile python. Can you check if python is really compiled for arm and not running in emulation mode? |
I have the same setup (Mac M2 Pro / Ventura / clang+mpi), and I get the same errors with the master branch. However, this is all very strange as I also tried version 2.8.1 on the same Mac M2 Pro machine (same compilers), and there I got no errors at all in the ves regtests. The code in the ves module is nearly the same between the v2.8.1 and the master. Thus, there are no changes in the ves code that can explain the issues. I also checked the v2.9 branch and there are I observe the errors in the regtests. I also tested v2.8.2 and master branch on a Linux cluster and I do not observe any errors there. Thus, there is some change between 2.8.1 and 2.9 that leads to these regtests errors that are only observed on Mac OS. And these are changes outside the I don't think it is an issue with an uninitialised variable in the ves code, rather the regtest errors in the I have a hunch that this is somehow related to the parsing and conversion of input parameters, but I need to look into that better later. |
Okay, I must take back what I said about no errors with v2.8.1. I used a binary compiled on Feb 20, 2023 when I did that test. I have now re-compiled the v2.8.1 version and see the same errors. Thus, there was some change in the clang compilers between Feb 20 and now that results in these errors. Unfortunately, I don't have the old config.log file for the Feb 20 compilation, so I don't know the clang version used then. Thus, could these issues be due to some bugs in the clang compilers? |
The clang compilers are updated from time to time, so it likely they were updated on my computer between Feb 20 and now. At least, I can't see any reason for these errors in the ves regtests that can be explained by PLUMED. My feeling is rather that this must be some compiler bug/issue. At the moment I have the following clang version:
|
@valsson I think there are two possibilities:
I would assume the second more likely, but of course it's not 100% Then, the bug is either in ves code or the core. For instance, there might be some specific grid features that are only used in ves. Given that the error appears in several ves tests, it would be very useful to leverage on your knowledge of the code to either find the bug in ves or point us to the possibile unusual use case for core feature (e.g. grid) that triggers the problem so we can fix it. Meanwhile I will test on my intel Mac if I can use the same clang version. And I would suggest you or @gtribello to recompile with -O2 on arm, to see if the problem only appears when heavily optimizing. Another useful thing would be to find some flag to initialize memory to non null and see if this triggers the problem with gcc/Linux as well. And finally if we can have a minimal failing example, on Linux we can run it through Valgrind, which could detect access to initialized memory also in optimized code (not sure this can be done on Mac) Thanks!!! |
Ciao @maxbonomi @GiovanniBussi and @carlocamilloni Separately from the issues with ves you have probably noticed that decreasing the precision of the files containing the derivatives has caused all sorts of problems with the regression tests on Github. Some other strategy is thus required. I was wondering: What if I deleted the output of the numerical derivatives in these tests? The mismatches that I see in master are all because of differences in derivatives that are calculated numerically. Do we really need to test the numerical derivatives here. I mean, if someone is using EMMI as a collective variable they are going to be evaluating the derivatives analytically so why don't we just do: EMMI ... DUMPDERIVATIVES ARG=gmm.scoreb STRIDE=1 FILE=deriva FMT=%8.4f With no numerical derivatives at all? Would anyone object to making that change? |
I totally agree, I think we should keep a few tests with numerical derivatives to test the numerical derivatives implementations, but not for general CVs |
I agree, I think this is just there because during CVs implementation it is good to test numerical derivatives, but once they are correct it is enough to store the analytical value to check for future deviations |
Hello @gtribello and @GiovanniBussi I looked further into the issue and it seems that it is related to the conversion from string to a double in the setup of the grid. Let's take the Here I define the basis function to be on the range from -4.0 to 4.0 using inputs parameters that are converted from strings to double using Basis function: Grid Thus the grid has a max range that is slightly above 4.0 and when the calculation is done for the last grid point, the code detects it outside the range of basis functions and gives zero in Thus, for some reason the conversion in There is a similar issue with the target distributions used in this test, defined on -4.0 to 4.0, and the grid should also be defined on the same range but the last grid point is above 4.0, this is the reason for the fails in Interestingly, the Thus, in a nutshell, I think that the issue is somehow in the Tools::convert(str, double) used in Grid.cpp. Could this be some issue with lepton in Tools::convert(str, double)? One way to fix this issue within the ves code would be to change how the check for if we are inside the defined range is done for the basis functions and target distributions. At the moment, this is done using just a simple if larger than, see for example here. Instead I could do this with a numerical threshold so that a number like "4.000000000000000888178420" would be considered as 4.0. Still, I think it would be good to fix the issue with the Tools::convert(str, double) used in Grid.cpp. |
@valsson interesting. I am not sure I understood how the two conversions are done.
Thanks! |
Let me try to explain the difference between The two regtests had a small difference that one was using "-4.0" and "4.0", while the other was using "-4.0" and "+4.0", so an explicit "+" in one case, but I tested that and it does not change the results. In both cases, we define target distributions that are outputted to files. The Both of them use the same call to
The difference between the two calls Thus, in For some reason, the greatest using This difference with how the Thus, this does not make much sense to me. |
I will continue to do further checks when I have time |
…cision in others There are a number of regtests that fail on my new mac laptop. I can remove many of these failures by not printing the numerical derivatives. I have thus done this where I can. For the test in ves/rt-bf-custom-transform I have decreased the precision of the output so that the test passes on my machine
00d51cd
to
ce4cebf
Compare
…not passing on new mac and github
I cannot get these tests to pass simultaneously on github and my mac
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #950 +/- ##
=======================================
Coverage 84.07% 84.07%
=======================================
Files 602 602
Lines 56230 56230
=======================================
Hits 47276 47276
Misses 8954 8954 ☔ View full report in Codecov by Sentry. |
@valsson I am back trying to fix this in order to enable tests on GitHub Actions (they provide arm64 runners now!) I am trying to reproduce the problem in conversions with this input file:
On my mac mini it correctly prints
So I think the problem is not in Tools::convert. Maybe there's some processing done on the double precision numbers (e.g., subtracting and adding the same quantity again) that creates the problem? Another puzzling thing is that the number you reported (4.000000000000000888178420) seems identical to 4.0 when represented as a double (52 binary digits in mantissa ~ 15 significant decimal digits). It would be easy to remove the extra digits in convert, but I think that this is not the place where they were added. |
Hello @GiovanniBussi, thank you for reminding me about the issue! I have been looking at this again, and I believe that I have found the issue. Indeed, it is not related to Tools::convert, but rather seem to be related to the calculation of the dx_ in the grid, Line 230 in 315b86f
If I add the following print out at this place
I get the following output on my Mac laptop
This then leads to the issue with the VES basis functions when the code uses the Grid::getPoint function to get the values on the grid (this uses only min_[i] and dx_[i]) so that upper range of the grid has a value 4.000000000000000888178420 (not 4.0 like it should) But I am still trying to figure out what to make of this. I tried to do the same on a Linux cluster, and the output from the dx_ calculation is the same as above for the Mac Laptop. However, I do not have this issue with grid, the upper range is there 4.000000000000000000000000, and the regtest passes fine. Thus, could it somehow be used to the getPoint function in the grid? Line 106 in 315b86f
Could it be that on Mac/Arm/clang, the numerical issues with dx_ are acculmated when one loops over the points on the grid? |
…ling on my new mac
Description
This is a first attempt to fix some of the issues with the regtests that fail on my new Mac. 37 tests in the current master fail when they are run on my machine at the moment. I am creating this PR so that the issues are recorded somewhere so other folks can become aware of them. I hope we can work together to fix these problems. There are a variety of problems that I will describe below.
So first of this commit fixes issues with the following tests by reducing the precision of the output files:
basic/rt28,
basic/rt34,
basic/rt65,
emmi/rt-emmi-gauss-mpi/,
isdb/rt-emmi-gauss/,
isdb/rt-emmi-marginal-mpi/,
isdb/rt-emmi-marginal/,
isdb/rt-emmi-out-mpi/,
isdb/rt-emmi-out/,
secondarystructure/rt32/,
ves/rt-bf-custom-transform/
There are then some problems related to zeros and spaces. For example you get the following differences in basic/rt-multi-1:
Diff for ff.3:
14,16c14,16
< 12 553.589838 0.000000
< 13 553.589838 0.000000
< 14 553.589838 0.000000
This problem also affects isdb/rt-caliber
Ves uninitialised variable?
There also appears to be an initialised variable in ves somewhere. I think this is causing problems with the following tests:
ves/rt-bf-chebyshev/
ves/rt-bf-combined/
ves/rt-bf-combined/
ves/rt-bf-custom/
ves/rt-bf-gaussians/
ves/rt-bf-legendre-scaled/
ves/rt-bf-legendre/
ves/rt-bf-powers/
ves/rt-bf-splines/
ves/rt-bf-wavelets-db/
ves/rt-bf-wavelets-sym/
You basically see differences like this:
< 1 0.000000 1 T1(s)
< 2 -0.333319 2 T2(s)
< 3 0.000000 3 T3(s)
< 4 -0.066607 4 T4(s)
< 5 0.000000 5 T5(s)
The fact that the zeros are shifted to some consistently non-zero value is what makes me suspect that some variable has not been initialised to zero.
Miscellaneous ves errors
There are then the following differences in other tests that appear in the ves repository:
ves/rt-bf-fourier/
Diff for bf.derivs-outside.data:
367c367
< 4.000000 0.000000 0.000000 -0.785398 0.000000 1.570796 0.000000 -2.356194 0.000000 3.141593 0.000000 -3.926991 0.000000 4.712389 0.000000 -5.497787 0.000000 6.283185 0.000000 -7.068583 0.000000 7.853982
ves/rt-bf-sine/
Diff for bf.derivs-outside.data:
367c367
< 4.000000 0.000000 -0.785398 1.570796 -2.356194 3.141593 -3.926991 4.712389 -5.497787 6.283185 -7.068583 7.853982
ves/rt-le-2d-legendre-scaled-welltempered/
Diff for fes.ves1.iter-10.data:
2622c2622
< 0.753982237 -1.570796327 39.115531863
ves/rt-le-2d-uniform-2/
Diff for bias.ves1.iter-10.data:
8310c8310
< -2.010619298 1.151917306 7.125989083 -101.862535156 -15.695990206
ves/rt-le-2d-uniform/
Diff for bias.ves1.iter-10.data:
8310c8310
< -2.010619298 1.151917306 7.125989083 -101.862535156 -15.695990206
ves/rt-output-fes-2d-targetdist/
Diff for fes.ves1.iter-10.data:
6967c6967
< 2.450442270 1.130973355 84.571328011
ves/rt-td-generalized-extreme-value/
Diff for targetdist-5.data:
506c506
< 40.000000 0.000000
Other Miscellaneous problems
There are then these other miscellaneous problems:
asic/rt-make-1:
Diff for logfile:
19c19
< Shifts 1.0
basic/rt-make-exceptions:
ERROR: I cannot find file nonexist.cpp
python/rt-protein:
+++ Loading the PLUMED kernel runtime +++
+++ PLUMED_KERNEL="/Users/garethtribello/Projects/CVception/Clean-version/New-new-version/plumed2//src/lib/libplumedKernel.dylib" +++
+++ An error occurred. Message from dlopen(): dlopen(/Users/garethtribello/Projects/CVception/Clean-version/New-new-version/plumed2//src/lib/libplumedKernel.dylib, 0x000A): tried: '/Users/garethtribello/Projects/CVception/Clean-version/New-new-version/plumed2//src/lib/libplumedKernel.dylib' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64')), '/System/Volumes/Preboot/Cryptexes/OS/Users/garethtribello/Projects/CVception/Clean-version/New-new-version/plumed2//src/lib/libplumedKernel.dylib' (no such file), '/Users/garethtribello/Projects/CVception/Clean-version/New-new-version/plumed2//src/lib/libplumedKernel.dylib' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64')), '/Users/garethtribello/Projects/CVception/Clean-version/New-new-version/plumed2/src/lib/libplumedKernel.dylib' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64')), '/System/Volumes/Preboot/Cryptexes/OS/Users/garethtribello/Projects/CVception/Clean-version/New-new-version/plumed2/src/lib/libplumedKernel.dylib' (no such file), '/Users/garethtribello/Projects/CVception/Clean-version/New-new-version/plumed2/src/lib/libplumedKernel.dylib' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64')) +++
+++ This error is expected if you are trying to load a kernel <=2.4
+++ Trying /Users/garethtribello/Projects/CVception/Clean-version/New-new-version/plumed2//src/lib/libplumed.dylib +++
+++ An error occurred. Message from dlopen(): dlopen(/Users/garethtribello/Projects/CVception/Clean-version/New-new-version/plumed2//src/lib/libplumed.dylib, 0x000A): tried: '/Users/garethtribello/Projects/CVception/Clean-version/New-new-version/plumed2//src/lib/libplumed.dylib' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64')), '/System/Volumes/Preboot/Cryptexes/OS/Users/garethtribello/Projects/CVception/Clean-version/New-new-version/plumed2//src/lib/libplumed.dylib' (no such file), '/Users/garethtribello/Projects/CVception/Clean-version/New-new-version/plumed2//src/lib/libplumed.dylib' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64')), '/Users/garethtribello/Projects/CVception/Clean-version/New-new-version/plumed2/src/lib/libplumed.dylib' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64')), '/System/Volumes/Preboot/Cryptexes/OS/Users/garethtribello/Projects/CVception/Clean-version/New-new-version/plumed2/src/lib/libplumed.dylib' (no such file), '/Users/garethtribello/Projects/CVception/Clean-version/New-new-version/plumed2/src/lib/libplumed.dylib' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64')) +++
Traceback (most recent call last):
File "/Users/garethtribello/Projects/CVception/Clean-version/New-new-version/plumed2/regtest/python/rt-protein/tmp/./python-script.py", line 86, in
p = plumed.Plumed()
File "plumed.pyx", line 87, in plumed.Plumed.init
raise RuntimeError("PLUMED not available, check your PLUMED_KERNEL environment variable")
RuntimeError: PLUMED not available, check your PLUMED_KERNEL environment variable
FAILURE
FILE logfile does not exist
Unexamined problems
I still need to look into what is going on with:
funnel/rt-sphere
crystallization/rt-averaged-q6-spAspB/
crystallization/rt-wcsurface2/
Target release
I would like my code to appear in release v2.10
Type of contribution
Copyright
COPYRIGHT
file with the correct license information. Code should be released under an open source license. I also used the commandcd src && ./header.sh mymodulename
in order to make sure the headers of the module are correct.Tests