sdm.dot_product_mkl ends with segmentation fault for sparse-sparse multiplication #30

maclin726 · 2024-11-21T21:19:55Z

Hi, thanks for your packages for the mkl python interface. It's quite useful.
However we encounter some unexpected results when performing sparse-sparse matrix multiplication.
It sometimes leads to a segmentation fault.

A minimum code snippet to reproduce the bug:
Please first download the following two matrices (the bug only appears for certain matrices)
A.npz: https://drive.google.com/file/d/1NRT8SchOS3XefZokbFOpqJw6CIygTEQ-
B.npz: https://drive.google.com/file/d/1aFDa2BbNQRGmmlAceIjK4JoogVQfKJY_/

import scipy.sparse as sparse
import sparse_dot_mkl as sdm

sdm.mkl_set_num_threads_local(1)

A = sparse.load_npz("./A.npz") # shape: (1934, 2381304)
B = sparse.load_npz("./B.npz") # shape: (100, 1934)

print("A.shape:", A.shape, "B.shape:", B.shape)

C = sdm.dot_product_mkl(A.T, B.T) # segmentation fault
# C = sdm.dot_product_mkl(B, A).T # works normally

print(C.shape)

If the first line is called, it will cause a segmentation fault.
However, if the second line is called, the segmentation fault does not happen.
We also try to first transform A matrix into coo format, or print(A) prior to the matrix multiplication, and the segmentation fault won't happen either.
However it is quite uncomfortable because we didn't find the exact reason for that.
So we turn to your help for this. Thank you in advance.
(We tried this on multiple machines and for this example it always happens)

The text was updated successfully, but these errors were encountered:

asistradition · 2024-11-21T22:04:24Z

I can replicate this 100% with the code and files provided, and the segfault happens internal to mkl_sparse_spmm.

I fix this 100% by copying the loaded object once before passing them into the multiplication. Does this problem always happen after deserializing data from files?

import scipy.sparse as sparse
import sparse_dot_mkl as sdm

sdm.mkl_set_num_threads_local(1)
sdm.set_debug_mode(True)

A = sparse.load_npz("./A.npz") # shape: (1934, 2381304)
B = sparse.load_npz("./B.npz") # shape: (100, 1934)

A = sparse.csr_matrix(A, copy=True)
B = sparse.csr_matrix(B, copy=True)

C = sdm.dot_product_mkl(A.T, B.T)

maclin726 · 2024-11-21T22:26:49Z

Thanks for your quick response.
No. I think the problem doesn't originated from reading the files.

import scipy.sparse as sparse
import sparse_dot_mkl as sdm

sdm.mkl_set_num_threads_local(1)

A = sparse.load_npz("./A.npz") # shape: (1934, 2381304)
B = sparse.load_npz("./B.npz") # shape: (100, 1934)

print("A.shape:", A.shape, "B.shape:", B.shape)

A[0, 0] = 0.001
sparse.save_npz("A_new.npz", A)
A = sparse.load_npz("./A_new.npz")

C = sdm.dot_product_mkl(A.T, B.T) # segmentation fault
# C = sdm.dot_product_mkl(B, A).T # works normally

print(C.shape)

If I tried to modify the matrix and save it again. Then no segfault for A_new.T * B.T

In our use case we don't save the matrix into files and segfault still happens. (so we try to dump the matrix and see what it happens)

asistradition · 2024-11-21T23:36:31Z

I'll see what I can do. Unfortunately, even though I can replicate it 100% of the time, it's not occurring when I run it with valgrind.

Copying the indices (A.indices = A.indices.copy()) is enough to suppress the problem, which does not really help figure out the root cause at all.

maclin726 · 2024-11-22T13:24:59Z

Thanks a lot. Maybe we will tentatively use the copy trick to bypass this issue. Hope that one day you may find out the reason. Thanks.

maclin726 · 2024-11-25T08:51:48Z

Sadly in our use case the segfault still happens for other matrices even A.indices = A.indices.copy is used.

asistradition · 2024-11-27T22:51:16Z

With gdb I can see that it's segfaulting in the same place in mkl_sparse_d_do_sp2m_i4_avx2, mkl_sparse_d_do_sp2m_i4_avx2, mkl_sparse_d_do_sp2m_i8_avx2, and mkl_sparse_d_do_sp2m_i8_avx2. It's not a copy/own array issue with the indices because they're copied to cast them for the i8 routines.

Helgrind suggests that there's a race condition in mkl_sparse_s_convert_csr_i4_avx2 in the MKL worker thread and a numpy memmove from PyArray_NewCopy in the python thread, which might be what's segfaulting, but I can't actually get it to happen when the debugger is running. I don't know why numpy would be instantiating a copy and deallocating an array while MKL is working.

C = sdm.dot_product_mkl(A.T.tocsr(), B.T.tocsr())

I suspect that converting the CSC to a CSR in python ahead of time would fix this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sdm.dot_product_mkl ends with segmentation fault for sparse-sparse multiplication #30

sdm.dot_product_mkl ends with segmentation fault for sparse-sparse multiplication #30

maclin726 commented Nov 21, 2024 •

edited

Loading

asistradition commented Nov 21, 2024

maclin726 commented Nov 21, 2024

asistradition commented Nov 21, 2024

maclin726 commented Nov 22, 2024

maclin726 commented Nov 25, 2024

asistradition commented Nov 27, 2024 •

edited

Loading

sdm.dot_product_mkl ends with segmentation fault for sparse-sparse multiplication #30

sdm.dot_product_mkl ends with segmentation fault for sparse-sparse multiplication #30

Comments

maclin726 commented Nov 21, 2024 • edited Loading

asistradition commented Nov 21, 2024

maclin726 commented Nov 21, 2024

asistradition commented Nov 21, 2024

maclin726 commented Nov 22, 2024

maclin726 commented Nov 25, 2024

asistradition commented Nov 27, 2024 • edited Loading

maclin726 commented Nov 21, 2024 •

edited

Loading

asistradition commented Nov 27, 2024 •

edited

Loading