Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference in intersection points between CPU and CUDA navigator #192

Open
beomki-yeo opened this issue Feb 3, 2022 · 8 comments
Open
Assignees
Labels
help wanted Extra attention is needed priority: low Low priority question Further information is requested

Comments

@beomki-yeo
Copy link
Collaborator

I will ask about it in tomorrow meeting. But let me write it down here as well
In this unit test, I compared the intersection points in volume between CPU and CUDA navigator.

Because I disabled the fast-math mode in CUDA, I thought I would get the identical intersection points from them.
But they are not exactly the same (close within the floating point error).

Is it supposed to be like this or could it be from the wrong usage of mathematical functions either in detray or algebra-plugins?

@beomki-yeo beomki-yeo changed the title Difference in intersection points between CPU and CUDA Difference in intersection points between CPU and CUDA navigator Feb 3, 2022
@beomki-yeo
Copy link
Collaborator Author

beomki-yeo commented Feb 3, 2022

BTW I can't see @cgleggett in the Assignees list, let me ping you directly

@stephenswat
Copy link
Member

In principle, IEEE 754 math is not guaranteed to be associative, so if the order of operations is different in any way between the CUDA implementation and the CPU implementation, this behaviour is to be expected. For example, if the CPU version looks like this:

float dot(vector3 a, vector3 b) {
    return (a.x * b.x) + ((a.y * b.y) + (a.z * b.z));
}

...and the CUDA version looks like this:

float dot(vector3 a, vector3 b) {
    return ((a.x * b.x) + (a.y * b.y)) + (a.z * b.z);
}

...then by the laws of floating point arithmetic it is expected for them to return different values. In addition, there are certain unsafe floating point optimisations that GCC and CUDA do that are not compliant with the IEEE 754 standard, which can also introduce problems like this. Check out this page for more information. These should be safe, but I suppose if we're talking about extremely small errors, this could be a factor.

If you really want bit-by-bit identical results, then you should:

  1. Make sure that the order of operations, as well as the order of operands, is identical between all implementations of the algebra code.
  2. Make sure that the CUDA code does not use its own in-built intrinsics math instructions.
  3. Compile both programs with full IEEE 754 compliance, by using -frounding-math -fsignaling-nans on GCC, and whatever the corresponding flag is on both NVCC's host compiler and its device compiler.

That said, I would strongly recommend dropping the bit-wise equality requirement and going for an epsilon-based equality instead. Lesson one of floating point arithmetic is that equality is meaningless when it comes to floating point numbers. Remember that, as long as you're propagating error in the same way, both the CPU and the CUDA version are both biased, probably in equal amounts. The issue is just that they are not biased in the same direction. However, both of them are fundamentally incorrect in the same way.

@stephenswat
Copy link
Member

Also, just to indicate another problem with this approach of requiring bit-wise equality: it reduces your numerical precision in many cases. If you turn off fast math you're turning off fused multiply-add, and fused multiply-add gives you a higher numerical precision than when you have it off, because you're only incurring a rounding error once, instead of twice (when executing both addition and multiplication).

@beomki-yeo
Copy link
Collaborator Author

beomki-yeo commented Feb 3, 2022

If you turn off fast math you're turning off fused multiply-add, and fused multiply-add gives you a higher numerical precision than when you have it off

Isn't it case-by-case? Can you generalize it for all cases?
Either way, I will drop this option for benchmark tests. The unit test fails without this option so I didn't have choice...

@beomki-yeo
Copy link
Collaborator Author

beomki-yeo commented Feb 3, 2022

Make sure that the order of operations, as well as the order of operands, is identical between all implementations of the algebra code.

The order of operations should be the same. I would investigate the second and third points you mentioned.

Lesson one of floating point arithmetic is that equality is meaningless when it comes to floating point numbers

You are right - But what I am worrying about is that the same track from cpu and cuda can pass different volumes due to different floating point arithmetic

@HadrienG2
Copy link

HadrienG2 commented Feb 4, 2022

FWIW, I discovered some time ago that unlike clang, GCC does generate FMAs by default even with fast-math off. The argument is that FMA increases precision, and as far as I know, there is no way to turn this off. This can make it hard to achieve bitwise reproducibility when comparing a GCC-like compiler with a clang-like compiler.

In the clang case, what you can do is turn on FMA contraction without the rest of the fast-math zoo with -ffp-contract=on or #pragma STDC FP_CONTRACT ON. But nvcc may have no equivalent option. And in any case, there is still a (small) chance that clang and GCC will take different FMA contraction decisions during the optimization process.

However, most of the time, you're shielded from this mess thanks to GCC generating code for old CPUs that don't support FMAs by default like all x86 compilers, unless you turned it on with something like -march=native or -mfma. So I don't think that's the issue here.

Another possibility, which I think is more likely to be the source of the issue here, is differences in libm output. Unlike basic floating-point operators and sqrt, higher-level math operations like exp, log, pow, sin and cos are not guaranteed to be bitwise reproducible across implementations due to the table maker's dilemma. So if you use any of these operations, you can pretty much give up on CPU/GPU perfect bitwise FP reproducibility.

As for this point...

what I am worrying about is that the same track from cpu and cuda can pass different volumes due to different floating point arithmetic

...you "just" need to incorporate it into your approximate reproducibility criterion. For example, in a propagator test, you could do this by generating lots of tracks and comparing histograms of final track parameters using something like a Kolmogorov-Smirnov test.

Or, less aggressively (but likely more complex from a code point of view), you could explicitly account for the possibility that a track could end up in a neighbouring volume in your test.

@stephenswat
Copy link
Member

You could also try adding the -mno-fma option (and -mno-fma4) to see if that helps.

@beomki-yeo
Copy link
Collaborator Author

@HadrienG2 @stephenswat Thanks for suggestions - I will keep this issue open and come back later

@beomki-yeo beomki-yeo added help wanted Extra attention is needed question Further information is requested labels Feb 9, 2022
@beomki-yeo beomki-yeo added the priority: low Low priority label Apr 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed priority: low Low priority question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants