Difference in intersection points between CPU and CUDA navigator #192

beomki-yeo · 2022-02-03T18:34:02Z

I will ask about it in tomorrow meeting. But let me write it down here as well
In this unit test, I compared the intersection points in volume between CPU and CUDA navigator.

Because I disabled the fast-math mode in CUDA, I thought I would get the identical intersection points from them.
But they are not exactly the same (close within the floating point error).

Is it supposed to be like this or could it be from the wrong usage of mathematical functions either in detray or algebra-plugins?

beomki-yeo · 2022-02-03T18:35:41Z

BTW I can't see @cgleggett in the Assignees list, let me ping you directly

stephenswat · 2022-02-03T20:23:51Z

In principle, IEEE 754 math is not guaranteed to be associative, so if the order of operations is different in any way between the CUDA implementation and the CPU implementation, this behaviour is to be expected. For example, if the CPU version looks like this:

float dot(vector3 a, vector3 b) {
    return (a.x * b.x) + ((a.y * b.y) + (a.z * b.z));
}

...and the CUDA version looks like this:

float dot(vector3 a, vector3 b) {
    return ((a.x * b.x) + (a.y * b.y)) + (a.z * b.z);
}

...then by the laws of floating point arithmetic it is expected for them to return different values. In addition, there are certain unsafe floating point optimisations that GCC and CUDA do that are not compliant with the IEEE 754 standard, which can also introduce problems like this. Check out this page for more information. These should be safe, but I suppose if we're talking about extremely small errors, this could be a factor.

If you really want bit-by-bit identical results, then you should:

Make sure that the order of operations, as well as the order of operands, is identical between all implementations of the algebra code.
Make sure that the CUDA code does not use its own in-built intrinsics math instructions.
Compile both programs with full IEEE 754 compliance, by using -frounding-math -fsignaling-nans on GCC, and whatever the corresponding flag is on both NVCC's host compiler and its device compiler.

That said, I would strongly recommend dropping the bit-wise equality requirement and going for an epsilon-based equality instead. Lesson one of floating point arithmetic is that equality is meaningless when it comes to floating point numbers. Remember that, as long as you're propagating error in the same way, both the CPU and the CUDA version are both biased, probably in equal amounts. The issue is just that they are not biased in the same direction. However, both of them are fundamentally incorrect in the same way.

stephenswat · 2022-02-03T20:41:40Z

Also, just to indicate another problem with this approach of requiring bit-wise equality: it reduces your numerical precision in many cases. If you turn off fast math you're turning off fused multiply-add, and fused multiply-add gives you a higher numerical precision than when you have it off, because you're only incurring a rounding error once, instead of twice (when executing both addition and multiplication).

beomki-yeo · 2022-02-03T20:53:54Z

If you turn off fast math you're turning off fused multiply-add, and fused multiply-add gives you a higher numerical precision than when you have it off

Isn't it case-by-case? Can you generalize it for all cases?
Either way, I will drop this option for benchmark tests. The unit test fails without this option so I didn't have choice...

beomki-yeo · 2022-02-03T20:55:44Z

Make sure that the order of operations, as well as the order of operands, is identical between all implementations of the algebra code.

The order of operations should be the same. I would investigate the second and third points you mentioned.

Lesson one of floating point arithmetic is that equality is meaningless when it comes to floating point numbers

You are right - But what I am worrying about is that the same track from cpu and cuda can pass different volumes due to different floating point arithmetic

HadrienG2 · 2022-02-04T07:08:51Z

FWIW, I discovered some time ago that unlike clang, GCC does generate FMAs by default even with fast-math off. The argument is that FMA increases precision, and as far as I know, there is no way to turn this off. This can make it hard to achieve bitwise reproducibility when comparing a GCC-like compiler with a clang-like compiler.

In the clang case, what you can do is turn on FMA contraction without the rest of the fast-math zoo with -ffp-contract=on or #pragma STDC FP_CONTRACT ON. But nvcc may have no equivalent option. And in any case, there is still a (small) chance that clang and GCC will take different FMA contraction decisions during the optimization process.

However, most of the time, you're shielded from this mess thanks to GCC generating code for old CPUs that don't support FMAs by default like all x86 compilers, unless you turned it on with something like -march=native or -mfma. So I don't think that's the issue here.

Another possibility, which I think is more likely to be the source of the issue here, is differences in libm output. Unlike basic floating-point operators and sqrt, higher-level math operations like exp, log, pow, sin and cos are not guaranteed to be bitwise reproducible across implementations due to the table maker's dilemma. So if you use any of these operations, you can pretty much give up on CPU/GPU perfect bitwise FP reproducibility.

As for this point...

what I am worrying about is that the same track from cpu and cuda can pass different volumes due to different floating point arithmetic

...you "just" need to incorporate it into your approximate reproducibility criterion. For example, in a propagator test, you could do this by generating lots of tracks and comparing histograms of final track parameters using something like a Kolmogorov-Smirnov test.

Or, less aggressively (but likely more complex from a code point of view), you could explicitly account for the possibility that a track could end up in a neighbouring volume in your test.

stephenswat · 2022-02-04T08:23:34Z

You could also try adding the -mno-fma option (and -mno-fma4) to see if that helps.

beomki-yeo · 2022-02-05T16:49:47Z

@HadrienG2 @stephenswat Thanks for suggestions - I will keep this issue open and come back later

beomki-yeo assigned stephenswat, HadrienG2 and krasznaa Feb 3, 2022

beomki-yeo changed the title ~~Difference in intersection points between CPU and CUDA~~ Difference in intersection points between CPU and CUDA navigator Feb 3, 2022

beomki-yeo added help wanted Extra attention is needed question Further information is requested labels Feb 9, 2022

beomki-yeo added the priority: low Low priority label Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference in intersection points between CPU and CUDA navigator #192

Difference in intersection points between CPU and CUDA navigator #192

beomki-yeo commented Feb 3, 2022

beomki-yeo commented Feb 3, 2022 •

edited

Loading

stephenswat commented Feb 3, 2022

stephenswat commented Feb 3, 2022

beomki-yeo commented Feb 3, 2022 •

edited

Loading

beomki-yeo commented Feb 3, 2022 •

edited

Loading

HadrienG2 commented Feb 4, 2022 •

edited

Loading

stephenswat commented Feb 4, 2022

beomki-yeo commented Feb 5, 2022

Difference in intersection points between CPU and CUDA navigator #192

Difference in intersection points between CPU and CUDA navigator #192

Comments

beomki-yeo commented Feb 3, 2022

beomki-yeo commented Feb 3, 2022 • edited Loading

stephenswat commented Feb 3, 2022

stephenswat commented Feb 3, 2022

beomki-yeo commented Feb 3, 2022 • edited Loading

beomki-yeo commented Feb 3, 2022 • edited Loading

HadrienG2 commented Feb 4, 2022 • edited Loading

stephenswat commented Feb 4, 2022

beomki-yeo commented Feb 5, 2022

beomki-yeo commented Feb 3, 2022 •

edited

Loading

beomki-yeo commented Feb 3, 2022 •

edited

Loading

beomki-yeo commented Feb 3, 2022 •

edited

Loading

HadrienG2 commented Feb 4, 2022 •

edited

Loading