-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multigrid on GPU yields different results to CPU #52
Comments
I've asked the LLM Qwen2, and got some hints to track down the problems, which might make sense: The differences in the convergence history between the CPU and GPU runs could be attributed to several factors, particularly those related to the linear solvers and preconditioners used in your PETSc configuration. Here are some potential causes and suggestions to troubleshoot and potentially resolve the issue:
To address these issues, consider the following steps:
By systematically addressing these potential causes, you should be able to identify and mitigate the differences in the convergence history between the CPU and GPU runs. |
These steps seem reasonable to me, did you have a chance to try any of them? |
Actually- some of these aspects seem to have been discussed in previous tandem meetings. I am not sure an LLM will be a better place to bounce off ideas from than discussing with the team |
@hpc4geo folllowing discussion on slack:
starting from 1 KSP, CPU and GPU differ
The GPU version doesn't use I-node optimizations Tried |
@Thomas-Ulrich
Tsst 1*: Run with above options (CPU)
|
Okay - I start to see the big picture of what is going wrong. The problem is a bad interaction between the smoother and the coarse solver. Here are two supporting observations
|
The last test to try for the moment (however I suspect the outcome will be the same) is using the following options
Tsst 3*: Run with above options (CPU)
|
here you go: |
here are the new logs: |
Results from a HIP build on LUMI-G indicate there is no problem at all. CPU run
GPU run
Comparing the convergence behavior they are identical. Output files are attached (along with the options file) |
One conclusion might be that the spack build @Thomas-Ulrich is running on the test machine is broken. |
Hi Dave, |
So I changed tandem as follow for hip: I compiled on Lumi with spack. Then use an interactive node, as you detailed on slack:
Then when I run, I get:
I need to test with rocm 6 (instead of 5.6.1)
|
@Thomas-Ulrich You should try with the latest petsc version (3.21.5). HIP Mat implementations are currently tagged as "under development". The team is constantly adding support for hip aij matrices (even as of v3.21.5). Hence I suspect what you see is due to a less complete implementation cf v3.21.5 (which I used). |
Thx, tried with v3.21.5 and got the same problem. I you could share how you build petsc and tandem, that would be useful.
Here is what I tried (based on your log):
and I get:
|
@hpc4geo: I tested the exact same setup you are running, but on heisenbug, and got the same results as you (CPU == GPU to some extent). You can (probably) check on LUMI that the convergence issue is not solve
edit: added the logs
|
@Thomas-Ulrich . The software stack is pretty complicated. It mixes many different packages. We are also mixing how we build / assemble the stack. We are also mix and matching devices (AMD vs NVIDIA). To try and make sense of all this, I've put everything we have found into a table.
Note that |
I've added a description here #76
Right. The choices are
As for tandem I had to hack a few things. I didn't do the job properly hence I didn't commit these changes.
Several tandem source files containing pure PETSc code need updating for PETSc 3.21.5. We rarely use this functionality, hence I just stop compiling them out of laziness.
I am unclear why the 1D instance was suddenly required. |
@Thomas-Ulrich I re-ran your problematic setup on LUMI. Please note that the arg CPU
GPU
The output files are attached - the results are nearly identical. Now, if you use the option
|
ok, I see, on heisenbug, GPU==CPU for the problematic setup if I remove the matrix free option. |
Looking backwards may not be super helpful as so many things have changed in both the software stack and the solver options being used. The test from 1 year ago used these smoother/coarse solver options Let's look forwards and work towards resolving any issues that we have today with: the latest petsc; current up-to-date software stacks; and importantly on machines LMU promised to make tandem run on in exchange for EU funding. |
@Thomas-Ulrich I've put all the LUMI-G mods (including yours) into dmay/petsc_dev_hip. I've also resolved the error encountered using Collectively these changes enable me to run your example with a bjacobi/ilu smoother and GAMG as the coarse grid PC and get identical convergence on the CPU and GPU. |
Hi @hpc4geo, |
Issue #50 identified some unexpected behavior when comparing CPU results with GPU results. The convergence history is different when the same PETSc option set is provided for an MG configuration.
Attached are the logs Thomas generated.
tandem_GPU.log
tandem_CPU.log
The things most likely causing differences in the residual history are probably associated with the ILU and LU solvers. Suggest confirming this by re-running CPU and GPU variants with the following additional options (placed at the end of any existing options).
The text was updated successfully, but these errors were encountered: