Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using @view leads to 100x performance loss #475

Closed
RomeoV opened this issue Nov 22, 2023 · 7 comments · Fixed by #476
Closed

Using @view leads to 100x performance loss #475

RomeoV opened this issue Nov 22, 2023 · 7 comments · Fixed by #476

Comments

@RomeoV
Copy link

RomeoV commented Nov 22, 2023

I have a dense-times-sparse matrix multiplication, but I only need some columns of the sparse matrix.
E.g., consider the following code:

using LinearAlgebra, SparseArrays, BenchmarkTools
D = rand(1000, 2000); X = sprand(2000, 10_000, 0.1);
w = findall(!iszero, X[1,:])
buf = zeros(size(D,1), length(w));

@benchmark buf.=D*X[:,w]
@benchmark buf.=D*@view X[:,w]

leads to 9ms and 700ms, respectively.

It took me a while to find this, as I had wrapped the code block this was originally in with a @views call. Any ideas where the slowdown comes from?

@ViralBShah
Copy link
Member

Might it be related to #469?

@RomeoV
Copy link
Author

RomeoV commented Nov 22, 2023

Chasing down the callstack, we eventually arrive either at the good case (no view)

function LinearAlgebra.generic_matmatmul!(C::StridedMatrix, tA, tB, A::DenseMatrixUnion, B::AbstractSparseMatrixCSC, _add::MulAddMul)
transA = tA == 'N' ? identity : tA == 'T' ? transpose : adjoint
if tB == 'N'
_spmul!(C, transA(A), B, _add.alpha, _add.beta)
elseif tB == 'T'
_A_mul_Bt_or_Bc!(transpose, C, transA(A), B, _add.alpha, _add.beta)

or the bad case (with view)
https://github.com/JuliaLang/julia/blob/5aaa94854367ca875375e38ae14f369f124e7315/stdlib/LinearAlgebra/src/matmul.jl#L767-L768

which is just the regular, non-sparse matmul.

versioninfo Julia Version 1.10.0-rc1 Commit 5aaa9485436 (2023-11-03 07:44 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 16 × 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake) Threads: 1 on 16 virtual cores
package info (shortened) (@v1.10) pkg> st Status `~/.julia/environments/v1.10/Project.toml` [2f01184e] SparseArrays v1.10.0

EDIT: Actually, the performance regression happens both on 1.10.0-rc1 with SparseArrays v1.10.0 and julia 1.9 with the stdlib SparseArrays.

@RomeoV
Copy link
Author

RomeoV commented Nov 22, 2023

It makes some sense when checking the type of X[:, w], in particular

julia> typeof(@view(X[:, w]))
SubArray{Float64, 2, SparseMatrixCSC{Float64, Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Vector{In
t64}}, false}

and

julia> typeof(@view(X[:, w])) |> supertypes
(SubArray{Float64, 2, SparseMatrixCSC{Float64, Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Vector{I
nt64}}, false}, AbstractMatrix{Float64}, Any)

i.e. it seems @view(X[:, w]) is not a subtype of any sparse type, which is why we get the wrong dispatch. Is this the correct behaviour?

@dkarrasch
Copy link
Member

We can (and should) extend the "good case" to

# Define an alias for views of a SparseMatrixCSC which include all rows and a unit range of the columns.
# Also define a union of SparseMatrixCSC and this view since many methods can be defined efficiently for
# this union by extracting the fields via the get function: getcolptr, getrowval, and getnzval. The key
# insight is that getcolptr on a SparseMatrixCSCView returns an offset view of the colptr of the
# underlying SparseMatrixCSC
const SparseMatrixCSCView{Tv,Ti} =
    SubArray{Tv,2,<:AbstractSparseMatrixCSC{Tv,Ti},
        Tuple{Base.Slice{Base.OneTo{Int}},I}} where {I<:AbstractUnitRange}
const SparseMatrixCSCUnion{Tv,Ti} = Union{AbstractSparseMatrixCSC{Tv,Ti}, SparseMatrixCSCView{Tv,Ti}}

but that won't help your case, because your column selection is not a unit range. I'm not sure there's an easy fix for that case, and you may be better off using slicing instead of views.

@RomeoV
Copy link
Author

RomeoV commented Nov 23, 2023

Totally see where you're coming from. It's just a devious performance trap that one wouldn't notice if you're not benchmarking - especially when you just put a @views macro in front of a line of matrix code (which seemed reasonable to me until two days ago).

@dkarrasch
Copy link
Member

Oh, wait. The multiplication code may not need the getcolptr function, so those methods look like they can be extended to non-unitranges in the second dimension. Let me see...

@dkarrasch
Copy link
Member

Ha!

julia> using LinearAlgebra, SparseArrays, BenchmarkTools
┌ Info: Precompiling SparseArrays [3f01184e-e22b-5df5-ae63-d93ebab69eaf]
└ @ Base loading.jl:2486

julia> D = rand(1000, 2000); X = sprand(2000, 10_000, 0.1);

julia> w = findall(!iszero, X[1,:])
986-element Vector{Int64}:
    6
   20
   43
   50
   52
   54
   62
   64
    
 9923
 9931
 9938
 9975
 9976
 9979
 9983
 9995

julia> buf = zeros(size(D,1), length(w));

julia> @benchmark buf.=D*X[:,w]
BenchmarkTools.Trial: 54 samples with 1 evaluation.
 Range (min  max):  88.266 ms  120.164 ms  ┊ GC (min  max): 0.00%  23.28%
 Time  (median):     91.749 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   93.900 ms ±   6.204 ms  ┊ GC (mean ± σ):  2.11% ±  4.72%

  ▂▅▂ ▂  █ █    ▅▂                                              
  █████▁██▅█▅██▅██▁▁█▁▁▁▁▅▅▁▁▁▁█▁█▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▅▁▁▁▁▁▅ ▁
  88.3 ms         Histogram: frequency by time          111 ms <

 Memory estimate: 10.56 MiB, allocs estimate: 14.

julia> @benchmark buf.=D*@view X[:,w]
BenchmarkTools.Trial: 55 samples with 1 evaluation.
 Range (min  max):  87.575 ms  116.487 ms  ┊ GC (min  max): 0.00%  18.59%
 Time  (median):     90.948 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   92.414 ms ±   4.903 ms  ┊ GC (mean ± σ):  1.22% ±  3.44%

      █▅ ▅   ▂     ▂▅      ▂                                    
  █▁▅▅████████▅█▅▅▁██▅▅▁▅▁▅█▁▅▁▅▁▅▁▁▁▁▁▁▁▁▁▁▁▅▁▅▁▁▁▁▁▁▁▁▅▁▁▁▁▅ ▁
  87.6 ms         Histogram: frequency by time          105 ms <

 Memory estimate: 7.52 MiB, allocs estimate: 6.

See #476.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants