Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rzadams multimat_test failure #1464

Open
bmhan12 opened this issue Nov 4, 2024 · 2 comments
Open

rzadams multimat_test failure #1464

bmhan12 opened this issue Nov 4, 2024 · 2 comments
Assignees
Labels
bug Something isn't working Reviewed

Comments

@bmhan12
Copy link
Contributor

bmhan12 commented Nov 4, 2024

The failing line looks to be this one:

test_multimat_conversion<double, 3>(layout_from, layout_to, unifiedAllocID);

Fails only for the double data type, int and float are passing.
Test passes on rzvernal/tioga.

Error message:

$ ctest -VV -R "multimat_test"

…
…
…

347: [INFO] Constructing Multimat object with layout Cell-Centric/Sparse
347:
347:    *MultiMat data was valid
347:
347: ***********************************
347: [WARNING in line 1346 of file /usr/WS1/han12/axom/src/axom/multimat/multimat.cpp]
347: MESSAGE=Multimat: cannot convert unowned field "UnownedField" to dense layout. Skipping.
347: ***********************************
347: [INFO] Converted multimat instance to layout Material-Centric/Dense
347: [INFO]
347: --------------------------------------------------
347:  Testing Multimat construction and conversion
347:  Cells: 20 Mats: 10 Data type: f Stride: 3
347: --------------------------------------------------
347:
347: [INFO] Constructing Multimat object with layout Cell-Centric/Sparse
347:
347:    *MultiMat data was valid
347: Memory access fault by GPU node-4 (Agent handle: 0x7ad150) on address (nil). Reason: Unknown.
1/1 Test #347: multimat_test ....................Subprocess aborted***Exception:   1.37 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   1.44 sec

The following tests FAILED:
        347 - multimat_test (Subprocess aborted)
@bmhan12 bmhan12 added the bug Something isn't working label Nov 4, 2024
@publixsubfan
Copy link
Contributor

I did a cursory investigation and was able to reproduce the issue in ROCM 6.1.2 but not ROCM 6.2.1. Based on experience with another project, we may be able to add in -mllvm -amdgpu-legacy-sgpr-spill-lowering=true to mitigate this. But I need to figure out how to pass this in for Axom - the flag I was using to pass in that flag (-Xoffload-linker) is being ignored with amdclang.

@rhornung67
Copy link
Member

Keep an eye on this and check back when it makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Reviewed
Projects
None yet
Development

No branches or pull requests

3 participants