Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sdm.dot_product_mkl ends with segmentation fault for sparse-sparse multiplication #30

Open
maclin726 opened this issue Nov 21, 2024 · 6 comments

Comments

@maclin726
Copy link

maclin726 commented Nov 21, 2024

Hi, thanks for your packages for the mkl python interface. It's quite useful.
However we encounter some unexpected results when performing sparse-sparse matrix multiplication.
It sometimes leads to a segmentation fault.

A minimum code snippet to reproduce the bug:
Please first download the following two matrices (the bug only appears for certain matrices)
A.npz: https://drive.google.com/file/d/1NRT8SchOS3XefZokbFOpqJw6CIygTEQ-
B.npz: https://drive.google.com/file/d/1aFDa2BbNQRGmmlAceIjK4JoogVQfKJY_/

import scipy.sparse as sparse
import sparse_dot_mkl as sdm

sdm.mkl_set_num_threads_local(1)

A = sparse.load_npz("./A.npz") # shape: (1934, 2381304)
B = sparse.load_npz("./B.npz") # shape: (100, 1934)

print("A.shape:", A.shape, "B.shape:", B.shape)

C = sdm.dot_product_mkl(A.T, B.T) # segmentation fault
# C = sdm.dot_product_mkl(B, A).T # works normally

print(C.shape)

If the first line is called, it will cause a segmentation fault.
However, if the second line is called, the segmentation fault does not happen.
We also try to first transform A matrix into coo format, or print(A) prior to the matrix multiplication, and the segmentation fault won't happen either.
However it is quite uncomfortable because we didn't find the exact reason for that.
So we turn to your help for this. Thank you in advance.
(We tried this on multiple machines and for this example it always happens)

@asistradition
Copy link
Collaborator

I can replicate this 100% with the code and files provided, and the segfault happens internal to mkl_sparse_spmm.

I fix this 100% by copying the loaded object once before passing them into the multiplication. Does this problem always happen after deserializing data from files?

import scipy.sparse as sparse
import sparse_dot_mkl as sdm

sdm.mkl_set_num_threads_local(1)
sdm.set_debug_mode(True)

A = sparse.load_npz("./A.npz") # shape: (1934, 2381304)
B = sparse.load_npz("./B.npz") # shape: (100, 1934)

A = sparse.csr_matrix(A, copy=True)
B = sparse.csr_matrix(B, copy=True)

C = sdm.dot_product_mkl(A.T, B.T)

@maclin726
Copy link
Author

Thanks for your quick response.
No. I think the problem doesn't originated from reading the files.

import scipy.sparse as sparse
import sparse_dot_mkl as sdm

sdm.mkl_set_num_threads_local(1)

A = sparse.load_npz("./A.npz") # shape: (1934, 2381304)
B = sparse.load_npz("./B.npz") # shape: (100, 1934)

print("A.shape:", A.shape, "B.shape:", B.shape)

A[0, 0] = 0.001
sparse.save_npz("A_new.npz", A)
A = sparse.load_npz("./A_new.npz")

C = sdm.dot_product_mkl(A.T, B.T) # segmentation fault
# C = sdm.dot_product_mkl(B, A).T # works normally

print(C.shape)

If I tried to modify the matrix and save it again. Then no segfault for A_new.T * B.T

In our use case we don't save the matrix into files and segfault still happens. (so we try to dump the matrix and see what it happens)

@asistradition
Copy link
Collaborator

I'll see what I can do. Unfortunately, even though I can replicate it 100% of the time, it's not occurring when I run it with valgrind.

Copying the indices (A.indices = A.indices.copy()) is enough to suppress the problem, which does not really help figure out the root cause at all.

@maclin726
Copy link
Author

Thanks a lot. Maybe we will tentatively use the copy trick to bypass this issue. Hope that one day you may find out the reason. Thanks.

@maclin726
Copy link
Author

Sadly in our use case the segfault still happens for other matrices even A.indices = A.indices.copy is used.

@asistradition
Copy link
Collaborator

asistradition commented Nov 27, 2024

With gdb I can see that it's segfaulting in the same place in mkl_sparse_d_do_sp2m_i4_avx2, mkl_sparse_d_do_sp2m_i4_avx2, mkl_sparse_d_do_sp2m_i8_avx2, and mkl_sparse_d_do_sp2m_i8_avx2. It's not a copy/own array issue with the indices because they're copied to cast them for the i8 routines.

Helgrind suggests that there's a race condition in mkl_sparse_s_convert_csr_i4_avx2 in the MKL worker thread and a numpy memmove from PyArray_NewCopy in the python thread, which might be what's segfaulting, but I can't actually get it to happen when the debugger is running. I don't know why numpy would be instantiating a copy and deallocating an array while MKL is working.

C = sdm.dot_product_mkl(A.T.tocsr(), B.T.tocsr())

I suspect that converting the CSC to a CSR in python ahead of time would fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants