You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The sparse support in pytorch is still in beta, but as this package also supports CPU<->GPU conversions for dense data, it makes sense to add pytorch support here. Using pytorch to efficiently transfer the building blocks of sparse arrays (indptr, indices, values) could also be interesting.
One argument is that pytorch supports efficient CPU->GPU transfers via pinned memory, which are lacking in cupy. Here the transfer in cupy:
(cell 88 is the interesting one - repeatedly transferring data and reducing it on the GPU is our use case, often in a single pass. Cell 89 then shows that running multiple kernels over the data is very efficient, too)
Same with torch and pinned tensors:
That's a factor of ~2.6 difference in time to completion, most of the difference being useless copying on the CPU.
Using managed memory / unified memory / mapped arrays (or how you want to call it) is also an option - but transfers the array on each use. So it would only be a win for single-pass operations. Here implemented using numba.cuda.mapped_array:
The sparse support in pytorch is still in beta, but as this package also supports CPU<->GPU conversions for dense data, it makes sense to add pytorch support here. Using pytorch to efficiently transfer the building blocks of sparse arrays (indptr, indices, values) could also be interesting.
One argument is that pytorch supports efficient CPU->GPU transfers via pinned memory, which are lacking in cupy. Here the transfer in cupy:
(cell 88 is the interesting one - repeatedly transferring data and reducing it on the GPU is our use case, often in a single pass. Cell 89 then shows that running multiple kernels over the data is very efficient, too)
Same with torch and pinned tensors:
That's a factor of ~2.6 difference in time to completion, most of the difference being useless copying on the CPU.
Until this is improved in cupy, going via pytorch may be a valuable addition. We can also add torch<>cupy conversions, via the
__cuda_array_interface__
.Using managed memory / unified memory / mapped arrays (or how you want to call it) is also an option - but transfers the array on each use. So it would only be a win for single-pass operations. Here implemented using
numba.cuda.mapped_array
:Relevant cupy issues and comments:
cudaHostRegister()
may reduce time for CPU to GPU data transfer cupy/cupy#3450out
tocupy.asnumpy()
cupy/cupy#5155 (comment)The text was updated successfully, but these errors were encountered: