-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difference in intersection points between CPU and CUDA navigator #192
Comments
BTW I can't see @cgleggett in the Assignees list, let me ping you directly |
In principle, IEEE 754 math is not guaranteed to be associative, so if the order of operations is different in any way between the CUDA implementation and the CPU implementation, this behaviour is to be expected. For example, if the CPU version looks like this: float dot(vector3 a, vector3 b) {
return (a.x * b.x) + ((a.y * b.y) + (a.z * b.z));
} ...and the CUDA version looks like this: float dot(vector3 a, vector3 b) {
return ((a.x * b.x) + (a.y * b.y)) + (a.z * b.z);
} ...then by the laws of floating point arithmetic it is expected for them to return different values. In addition, there are certain unsafe floating point optimisations that GCC and CUDA do that are not compliant with the IEEE 754 standard, which can also introduce problems like this. Check out this page for more information. These should be safe, but I suppose if we're talking about extremely small errors, this could be a factor. If you really want bit-by-bit identical results, then you should:
That said, I would strongly recommend dropping the bit-wise equality requirement and going for an epsilon-based equality instead. Lesson one of floating point arithmetic is that equality is meaningless when it comes to floating point numbers. Remember that, as long as you're propagating error in the same way, both the CPU and the CUDA version are both biased, probably in equal amounts. The issue is just that they are not biased in the same direction. However, both of them are fundamentally incorrect in the same way. |
Also, just to indicate another problem with this approach of requiring bit-wise equality: it reduces your numerical precision in many cases. If you turn off fast math you're turning off fused multiply-add, and fused multiply-add gives you a higher numerical precision than when you have it off, because you're only incurring a rounding error once, instead of twice (when executing both addition and multiplication). |
Isn't it case-by-case? Can you generalize it for all cases? |
The order of operations should be the same. I would investigate the second and third points you mentioned.
You are right - But what I am worrying about is that the same track from cpu and cuda can pass different volumes due to different floating point arithmetic |
FWIW, I discovered some time ago that unlike clang, GCC does generate FMAs by default even with fast-math off. The argument is that FMA increases precision, and as far as I know, there is no way to turn this off. This can make it hard to achieve bitwise reproducibility when comparing a GCC-like compiler with a clang-like compiler. In the clang case, what you can do is turn on FMA contraction without the rest of the fast-math zoo with However, most of the time, you're shielded from this mess thanks to GCC generating code for old CPUs that don't support FMAs by default like all x86 compilers, unless you turned it on with something like Another possibility, which I think is more likely to be the source of the issue here, is differences in libm output. Unlike basic floating-point operators and sqrt, higher-level math operations like exp, log, pow, sin and cos are not guaranteed to be bitwise reproducible across implementations due to the table maker's dilemma. So if you use any of these operations, you can pretty much give up on CPU/GPU perfect bitwise FP reproducibility. As for this point...
...you "just" need to incorporate it into your approximate reproducibility criterion. For example, in a propagator test, you could do this by generating lots of tracks and comparing histograms of final track parameters using something like a Kolmogorov-Smirnov test. Or, less aggressively (but likely more complex from a code point of view), you could explicitly account for the possibility that a track could end up in a neighbouring volume in your test. |
You could also try adding the |
@HadrienG2 @stephenswat Thanks for suggestions - I will keep this issue open and come back later |
I will ask about it in tomorrow meeting. But let me write it down here as well
In this unit test, I compared the intersection points in volume between CPU and CUDA navigator.
Because I disabled the fast-math mode in CUDA, I thought I would get the identical intersection points from them.
But they are not exactly the same (close within the floating point error).
Is it supposed to be like this or could it be from the wrong usage of mathematical functions either in detray or algebra-plugins?
The text was updated successfully, but these errors were encountered: