Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results on Intel Xeon Gold 6448H due to OpenBLAS <0.3.21 #836

Open
brisk022 opened this issue Jul 25, 2024 · 6 comments
Open

Incorrect results on Intel Xeon Gold 6448H due to OpenBLAS <0.3.21 #836

brisk022 opened this issue Jul 25, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@brisk022
Copy link

brisk022 commented Jul 25, 2024

Container image name

All images with openblas 0.3.20 e.g. rocker/r-ver:4.4.0

Container image digest

No response

What operating system are you seeing the problem on?

Linux

System information

The problem appears only on Intel Xeon Gold 6448H. Older Intel CPUs as well as AMD CPUs seem to be unaffected.

Bug description

The upcoming switch to ubuntu 24 should fix this bug as openblas >=0.3.21 works fine. However, I decided to report it because the problem is rather egregious. All the calculations complete without any error messages but the results are wrong. The root problem is also very hard to find. We have a heterogeneous cluster and a user reported occasional problems with PCA results in seurat.

How to reproduce this bug?

The following code is taken from the R test suite.

library(splines)
d1 <- c(616.1, 570.1, 523.7, 477.3, 431.3, 386.2, 342.4, 300.4, 260.4,
        222.7, 187.8, 155.7, 126.7, 100.8,  78.1,  58.6,  42.2,  28.7,
         18.1,  10.2)
r1 <- c(104.4, 110  , 115.5, 121,   126.6, 132.1, 137.7, 143.2, 148.8,
        154.3, 159.9, 165.4, 170.9, 176.5, 182,   187.6, 193.1, 198.7,
        204.2, 209.8)
sp1 <- interpSpline(r1,d1)# 'x' as function of 'y' (!)
psp1 <- predict(sp1)
bsp1 <- backSpline(sp1)
dy <- diff(predict(bsp1, .5 + 18:30)$y)
dy

It should produce

[1] -0.5877246 -0.5627481 -0.5377715 -0.5127950 -0.4878185 -0.4628420
[7] -0.4378654 -0.4128889 -0.3879124 -0.3629359 -0.8722885 -0.4569234

On Intel Xeon Gold 6448H, it produces

[1] -0.6804002 -0.6745320 -0.6686639 -0.6627957 -0.6569276 -0.6510594
[7] -0.6451912 -0.6393231 -0.6334549 -0.6275868  0.9412817 -0.4788129
@brisk022 brisk022 added the bug Something isn't working label Jul 25, 2024
@benz0li
Copy link
Contributor

benz0li commented Jul 25, 2024

@brisk022 Is there a related issue at https://github.com/OpenMathLib/OpenBLAS?

@brisk022
Copy link
Author

Unfortunately, I did not find anything that would look related. However, I do not know hardware specifications very well. So, I might have missed the obvious.

@nathanweeks
Copy link

I can reproduce on a different Sapphire Rapids model.

Setting the environment variable OPENBLAS_CORETYPE=SKYLAKEX (see the OpenBLAS Usage guide and TargetList.txt) results in the first set of (I assume more-correct) results.

@benz0li
Copy link
Contributor

benz0li commented Aug 30, 2024

@nathanweeks Great find!

@nathanweeks
Copy link

FWIW this issue is reproducible on rocker/r-ver:4.2.2 (Ubuntu 22.04 / OpenBLAS 0.3.20 / gcc 11.4.0), but rocker/r-ver:4.2.1 (Ubuntu 20.04 / OpenBLAS 0.3.8 / gcc 9.4.0) is not affected.

@nathanweeks
Copy link

rocker/r-ver:4.4.1 is affected by this issue, but rocker/o-ver:4.4.2 (Ubuntu 24.04 / OpenBLAS 0.3.26) is not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants