-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there any plans for GPU support? #17
Comments
Hi @hakkelt, firstly to answer your question, is there is any plan for GPU support? well let's say there is no official one 😄 My experience with GPU programming is rather limited, however as far as I know one of the bottleneck of GPU programming is moving memory. So I believe efficient GPU processing could be reach only if all I'm not sure you can always avoid allocating/share buffers, in the approach we've taken here these buffers need to be stored in the forward pass in order to compute a successful backward pass and obtain gradients. This is strictly necessary when nonlinear operators are involved. Probably you can share buffers within linear operators as long as dimensions are consistent. Of course contributions are very welcome! 😉 |
Hi @nantonel, struct Compose{N, M, L<:NTuple{N,Any}, T<:NTuple{M,Any}} <: AbstractOperator
A::L
buf::T
end
function Compose(L1::AbstractOperator, L2::AbstractOperator)
if size(L1,2) != size(L2,1)
throw(DimensionMismatch("cannot compose operators"))
end
if domainType(L1) != codomainType(L2)
throw(DomainError())
end
Compose( L1, L2, Array{domainType(L1)}(undef,size(L2,1)) )
end
[...]
@generated function mul!(y::C, L::Compose{N,M,T1,T2},b::D) where {N,M,T1,T2,C,D}
ex = :(mul!(L.buf[1],L.A[1],b))
for i = 2:M
ex = quote
$ex
mul!(L.buf[$i],L.A[$i], L.buf[$i-1])
end
end
ex = quote
$ex
mul!(y,L.A[N], L.buf[M])
return y
end
end We can rewrite this code to the following to defer the allocation: mutable struct Compose{N, M, L<:NTuple{N,Any}, T<:NTuple{M,Any}} <: AbstractOperator
A::L
buf::Union{Nothing,T} # I modified the code HERE
end
function Compose(L1::AbstractOperator, L2::AbstractOperator)
if size(L1,2) != size(L2,1)
throw(DimensionMismatch("cannot compose operators"))
end
if domainType(L1) != codomainType(L2)
throw(DomainError())
end
Compose( L1, L2, nothing ) # I modified the code HERE
end
[...]
@generated function mul!(y::C, L::Compose{N,M,T1,T2},b::D) where {N,M,T1,T2,C,D}
ex = quote
L.buf[1] isa Nothing && ( L.buf[1] = Array{codomainType(L.A[1])}(undef, size(L.A[1],1)) ) # I modified the code HERE
mul!(L.buf[1],L.A[1],b))
end
for i = 2:M
ex = quote
$ex
L.buf[$i] isa Nothing && ( L.buf[$i] = Array{codomainType(L.A[$i])}(undef, size(L.A[$i],1)) ) # I modified the code HERE
mul!(L.buf[$i],L.A[$i], L.buf[$i-1])
end
end
ex = quote
$ex
mul!(y,L.A[N], L.buf[M])
return y
end
end I didn't run this code, but I think this approach should work (of course, after defining the backward @generated function mul!(y::CuArray, L::Compose{N,M,T1,T2},b::CuArray) where {N,M,T1,T2}
ex = quote
L.buf[1] isa Nothing && ( L.buf[1] = CuArray{codomainType(L.A[1])}(undef, size(L.A[1],1)) )
mul!(L.buf[1],L.A[1],b))
end
for i = 2:M
ex = quote
$ex
L.buf[$i] isa Nothing && ( L.buf[$i] = CuArray{codomainType(L.A[$i])}(undef, size(L.A[$i],1)) )
mul!(L.buf[$i],L.A[$i], L.buf[$i-1])
end
end
ex = quote
$ex
mul!(y,L.A[N], L.buf[M])
return y
end
end I'm not sure if it satisfies the criterion you mentioned ("these buffers need to be stored in the forward pass in order to compute a successful backward pass and obtain gradients"), but if it does and you don't see any obvious flaws which causes troubles in other components of the package (or in the other packages built upon it), then I would create a PR to see whether it works in practice or not. |
Btw it would mean that once an operator is applied on GPU arrays, it cannot be used on CPU arrays any more and vice versa. But it is possible to tackle this problem by modifying the definition of the struct to have two separate list of buffers: one for CPU buffers and the other for GPU buffers. |
If it's only about the possibility of changing the type of array of function Compose(L1::AbstractOperator, L2::AbstractOperator; T=Array)
if size(L1,2) != size(L2,1)
throw(DimensionMismatch("cannot compose operators"))
end
if domainType(L1) != codomainType(L2)
throw(DomainError())
end
Compose( L1, L2, T{domainType(L1)}(undef,size(L2,1)) )
end then you can set This also preserves All is needed is to define Regarding using both CPU and GPU perhaps another linear operator could be introduced, similar to |
Yes, it is also a solution, but I prefer the other way because it allows the end-users to switch from CPU to GPU without changing their code and it works also for shorthand constructors. In contrast, it might confuse the users that only calculus rules need to define explicitly the input-output type in their constructor (I think the linear and non-linear operators only need the overloading of their I don't think the deferred allocation is much more difficult to implement. On the other hand, your concern about making the structs mutable might be a more important question, but why shouldn't we do that? In my opinion, the performance hit would be insignificant. And if the possibility of changing other fields is your main concern, then it is possible to make an auxiliary mutable wrapper type to hold the buffer or nothing, so we can leave the mutable struct Buffer
b::Union{Nothing,Any}
end
struct Compose{N, M, L<:NTuple{N,Any}, T<:NTuple{M,Buffer}} <: AbstractOperator
A::L
buf::T
end Or are there any other issues with mutability? |
And another argument for the deferred allocation: It is much easier to decide in the @generated function mul!(y::C, L::Compose{N,M,T1,T2},b::D) where {N,M,T1,T2,C,D}
ex = quote
L.buf[1] isa Nothing && ( L.buf[1] = Array{codomainType(L.A[1])}(undef, size(L.A[1],1)) )
mul!(L.buf[1],L.A[1],b))
end
for i = 2:M
ex = quote
$ex
if L.buf[$i] isa Nothing
L.buf[$i] = ( size(L.A[$i],1) == size(L.A[$i-1],1) ? L.buf[$i-1] : Array{codomainType(L.A[$i])}(undef, size(L.A[$i],1)) )
end
mul!(L.buf[$i],L.A[$i], L.buf[$i-1])
end
end
ex = quote
$ex
mul!(y,L.A[N], L.buf[M])
return y
end
end Ok, this solution is not the best because some operators don't work in-place (or at least they need some modification to perform the operation in place, like DFT), but I know a better (but slightly more complex) solution that worked quite well for |
(A bit off-topic question by the way: What is the purpose of |
I actually don't remember! 😅 but this is off topic, feel free to open another issue. |
Well this package is not entirely thought for end users... the end-user package is StructuredOptimization.jl.
Some operators need buffers for themselves so I don't see how this is inevitable...
not sure, but mutable objects are more "dangerous"... Overall I don't understand why you want to switch between CPU and GPU... isn't it more efficient to stick to one of the two? As far as I know softwares like pytorch do not provide this possibility (possibly because transferring data from GPU and CPU introduces substantial delays). But as I said already, I'm not an expert on GPU programming. |
Ok, that's fair enough, but then, unfortunately, this package might not meet my needs. I'm working on optimization methods, too, but currently I'm rather just prototyping and trying out new algorithms, so the other packages in the KUL forBES group are too high-level for me.
Oh, I see now what you mean! No, I don't want to switch back and forth between CPU and GPU inside one algorithm. What I want is to implement the algorithm on CPU first, and when it runs without error, then I can switch to GPU arrays, following the recommended workflow of CuArrays. Ideally, when you have a working CPU implementation, all you need is to replace By the way, the interface of pytorch is intentionally designed to be almost identical to the interface of numpy, for the same reason (it is recommended to write your code using numpy first, and then you can easily switch to pytorch). |
Never mind, I've got it why. |
Ok, so I think we agree in the fact that all buffers in Then I'm sorry really sorry but I don't see what are the advantages of the delayed buffer allocations you proposed. If it's about sharing buffers it would be preferable to do this during construction. Something like: function Compose(L::Array{AbstractOperators}; T=Array)
# check properties of `L` and see if any buffer can be shared
# allocate only the buffers that are needed (either in GPU or CPU using `T`)
# Construct `Compose` using `Compose((L[1],L[2],...),(buf1,buf2,buf1,...))`
end |
I mean, in my opinion, I think the minimal amount of code will be produced by having all calculus rules array agnostic. This is definitively not the case, but it shouldn't be too hard to achieve, for example by adding that keyword argument (and perhaps by giving an extra field e.g. |
Well, neither of us seems to convince the other. 😄 Unfortunately I don't really have more arguments, and I don't want to waste your time (I really appreciate that you share your concerns with my proposal), so all I can do is that I can try to summarize my pros and cons. Recommended workflow of GPU computing
What important here is that if you want to make it easy to use this package in GPU codes, then the best way to help the programmer is to make the transition from 1. to 2. as easy as possible. My proposal makes the transition easier because it doesn't require any explicit declaration as it can figure out what type of buffer needs to be allocated on its own. In the case of simple programs, it is not a big deal, but for larger programs with many operators, it can be quite cumbersome looking up all constructors. One code for CPU and GPUEven if I'm ready with the GPU code, it is quite frequent that I need to return to the CPU code. For example, I would like to extend or improve the algorithm because debugging a CPU code is much easier than a GPU code. Or I would like to work on my laptop (it doesn't have any GPU). Or someone else just wants to check out my code, but he or she doesn't have a GPU or doesn't want the hassle with installing the proper CUDA driver. If I can use the same code (or almost the same code) for both CPU and GPU, then I can save a tremendous amount of energy synchronizing the two implementations after each modification. Shorthand constructorsFor me, a very important advantage of using AbstractOperators.jl/FunctionOperators.jl/LinearMaps.jl/LinearOperators.jl is that I can write more elegant code that resembles the mathematical notations used in literature. And shorthand constructors are one of the most important tools to reach that. So my problem with the approach you proposed is that it cannot be adapted to shorthand constructors (or at least I cannot see an elegant way to do so). Main pointFrom the previous arguments, I hope it is clear why I dislike your proposal with the extra argument in the constructor. And I was thinking a bit, and finally I came to the conclusion that there might be quite a few other solutions besides mine which would satisfy me, so I'm totally open to some other alternatives. Just to mention a very simple: we can have a global boolean flag that decides whether Arrays or CuArrays are used as buffers (although I find this solution particularly ugly) However, I found one argument why my approach would be better than others, and it is thread-safeness. Let's assume, for example, that we want to use the same operator in multiple threads. I can solve it easily by extending the previously proposed structs: struct Buffer
b::Dict{Int, Union{Nothing,Any}}
end
struct Compose{N, M, L<:NTuple{N,Any}, T<:NTuple{M,Buffer}} <: AbstractOperator
A::L
buf::T
end
@generated function mul!(y::C, L::Compose{N,M,T1,T2},b::D) where {N,M,T1,T2,C,D}
ex = quote
L.buf[1].b[Threads.threadid()] isa Nothing && ( L.buf[1].b[Threads.threadid()] = Array{codomainType(L.A[1])}(undef, size(L.A[1],1)) )
mul!(L.buf[1].b[Threads.threadid()],L.A[1],b))
end
for i = 2:M
ex = quote
$ex
if L.buf[$i].b[Threads.threadid()] isa Nothing
L.buf[$i].b[Threads.threadid()] = Array{codomainType(L.A[$i])}(undef, size(L.A[$i],1))
end
mul!(L.buf[$i].b[Threads.threadid()],L.A[$i], L.buf[$i-1].b[Threads.threadid()])
end
end
ex = quote
$ex
mul!(y,L.A[N], L.buf[M].b[Threads.threadid()])
return y
end
end Thread-safeness might be achieved in other ways as well, but even besides that, I have the intuition that deferring the allocation gives much more flexibility due to the increased amount of information we can use in Reactions
Yes, definitely! Although previously I have written the following:
What I meant here is that sometimes such a thing can be handy, for instance, when I do the transition from the CPU code to GPU in an interactive session, but actually it would just complicate things, so let's just forget about it! 😄
You are right, sharing buffers can also be achieved with your method. So I tried to explain above why my proposal is absolutely not only about sharing buffers.
I disagree. I don't see any problems with mutable objects.
Can you elaborate this? Can this approach avoid having different constructors in CPU and GPU code? |
Don't you worry it's good that we discuss, it's not a waste of time at all, this is just a brainstorming and quite an important one 😉 Perhaps an alternative would be to change the abstract type This would allow to know where each operator lives, either in GPU or CPU. You will be then allowed to combine only operators that live in the same architecture. If you want to switch from one operator that lives in CPU to GPU all you have to do is to convert that object. A function In the case of composition: togpu(C::Compose{Array,...}) = Compose{CuArray,...}(togpu.(C.A),togpu.(C.buf) )
togpu(C::Compose{CuArray,...}) = C From the user prospective this would involve writing code like this: T = Array # here you can set to CuArray
A = AnOperator(T,...)
B = AnotherOperator(T,...)
C = SomeCalc(A,B) For operators that can have the same code for GPU and CPU I believe this thing should work, you just write: mul!(y::AbstractArray,L::Operator,::AbstractArray) and for those that need different implementations: mul!(y::T,A::Operator{T,...},::T) where {T <: CuArray} = ...
mul!(y::T,A::Operator{T,...},::T) where {T <: Array} = ... Compose would become: function Compose(L1::AbstractOperator{T}, L2::AbstractOperator{T}) where {T<:AbstractArray}
if size(L1,2) != size(L2,1)
throw(DimensionMismatch("cannot compose operators"))
end
if domainType(L1) != codomainType(L2)
throw(DomainError())
end
Compose( L1, L2, T{domainType(L1)}(undef,size(L2,1)) )
end |
Well, I still don't see what's wrong with the deferred allocation, and why we need to complicate our lives and increase unnecessarily the verbosity in function signatures 😄 (generally, I think more about the inconvenience of using later the package than the current implementational difficulties). In my opinion, the computing device is just a technical detail (both CPU and GPU give exactly the same result), so the usage of Anyway, if I put aside my dislike towards the explicitness and the extra parameter, your last proposal looks quite ok for me. 🙂 |
I may have missed some details from the discussion, but here’s my two cents: The issue is whether input/intermediate/output array types T should be
Option 1 is more static, but closer to the current package structure and therefore probably easier to implement: for example, the input/output dimensions are to be specified somehow at construction time (see eg the constructors here https://github.com/kul-forbes/AbstractOperators.jl/blob/master/src/linearoperators/DFT.jl). Option 2 is more dynamic, and I can see how it would be easier to use, but would probably require more work. It seems to me like in the end, computationally speaking, both options should be equivalent. Even if you have a “slicker” interface as in option 2, computation will require buffer types and sizes to be known to carry out. So one question to ask is whether both options can be obtained. In particular, going for option 1 first would be easier as I understand, but one could then allow for “lazier” construction syntax as in option 2: this would just store the operator parameters (not array sizes, not array types), and then “concretize” the operator once called. I’m sorry I don’t have code snippets to showcase this, but in my mind this can be implemented in a fairly modular way, so that one can use both syntaxes as preferred and the code overhead is minimal: the problem I see here is with combination rules, which would require to be implemented for this “lazy layer” too. The only thing I’m not a big fan of here is |
I agree with @lostella, option 1 will avoid changing calculus rules at the expense of changing the To me option 1 seems to me to be more consistent with the current philosophy of this package: static automatic differentiation (AD). Fully dynamic AD softwares are already available in Julia, for example Zygote which takes this dynamism at the compiler level. These definitively fully exploit what you can do with Julia. In my opinion, going towards that direction with AbstractOperators makes little sense (unfortunately)...
I see this possibility as well, but perhaps it would be tidier to have a package on top of AbstractOperator to achieve this: LazyAbstractOperators? 😆 Probably StructuredOptimization could be a suited candidate as well.
That's quite important point! But I wonder if you ever want want to deal with combinations of operators that have different array types... for example if you use static arrays I'd imagine you want to stick to those as well? Regarding verbosity of function signatures, well you can still keep it clean, for example you can build already julia> x = randn(10); DCT(x)
ℱc ℝ^10 -> ℝ^10
julia> x = randn(Complex{Float64},10);DCT(x)
ℱc ℂ^10 -> ℂ^10 Here, dimension and domain type are deuced from |
By the way, I'm not against this deferred allocation, my only argument was avoiding This is probably fine: @generated function mul!(y::CuArray, L::Compose{N,M,T1,T2},b::CuArray) where {N,M,T1,T2}
ex = quote
L.buf[1] isa Nothing && ( L.buf[1] = CuArray{codomainType(L.A[1])}(undef, size(L.A[1],1)) )
mul!(L.buf[1],L.A[1],b))
end
...
end The problem with this is that some operators inside In my opinion there must a strategy to avoid this problem and ensure everything lives in the right architecture. Option 2 fixes these problems. With Option 1 I don't see how to avoid this problem without modifying the constructors anyway. Plus it would be up to the user to carefully build its operators in the right architecture. |
Hi @lostella,
Hmm... ok, that's a good point. You guys might see better the "bigger picture" and you know better what was the main purpose of creating this package. 🙂
I agree that
Yes, all operators (both "atomic" like DFT or Filt and combination rules) need to be modified which allocates a buffer. For example, the same trick can be used in the case of
I don't think that a dynamic/lazy implementation can be implemented on the top of
Yes, this option would be a comfortable alternative for lazy programmers like me, if we opt to option 1. 😄 |
What I see now that both sides have quite strong arguments, which can be summarized in a following:
But now I'm more and more uncertain which one is better... |
For me it's clear that option 1 is more suited. But certainly coding will clear things up even more (other problems may arise). Let me know if you're planning a PR, otherwise I'll take care of it as soon as I have some time. |
Ok, then let's stick with option 1. I plan to make a PR, and I will start working on it soon. |
I've realised that we should avoid modifying the syb-typing of Instead we can introduce a predicate For example: A = AnOperator(CuArray....)
isgpu(A::AnOperator) = typeof(A.buffer) <: CuArray # for operators with buffers
isgpu(A::AnOperator{T}) = T <: CuArray # for other operators with no buffers
isgpu(A::AnOperator) = ... # for complete freedom of defining this This solves the problem that @lostella pointed out, that is we will have more freedom on the type of arrays to use. An example of composition is: function Compose(L1::AbstractOperator, L2::AbstractOperator)
if size(L1,2) != size(L2,1)
throw(DimensionMismatch("cannot compose operators"))
end
if domainType(L1) != codomainType(L2)
throw(DomainError())
end
if isgpu(L1) != isgpu(L2)
throw(ErrorException("Operators are not in the same architecture, use `togpu` or `tocpu` to convert them to the proper one."))
end
T = isgpu(L1) ? CuArray : Array
Compose( L1, L2, T{domainType(L1)}(undef,size(L2,1)) )
end This is also more consistent to what we've done in the package for example in the functions (It should also involve less coding) |
Well, I'm not really a fan of this solution because it looks like a patchwork: sometimes the type of buffer carries the information, in other cases, the computing device is inferred from the type of the operator. I would rather prefer a more consistent solution. The other issue I see is that For the second problem, there is an easy solution: A = AnOperator(CuArray....)
deviceType(A::AnOperator) = typeof(A.buffer) # for operators with buffers
deviceType(A::AnOperator{T}) = T # for other operators with no buffers
deviceType(A::AnOperator) = ... # for complete freedom of defining this But I don't know, what to do with the first. |
It's not really a patch. As I told you we do something really similar with It would be still possible to use other architectures: you just create different packages (which makes sense given those packages are incompatible with each other). What this packages will contain are simply extensions the methods of
T = iscpu(L1) ? Array : GPUArrayType()
Compose( L1, L2, T{domainType(L1)}(undef,size(L2,1)) ) In AbstractOperator we could simply have |
I must admit I also found the solution of On the other hand, I think you cannot avoid adding Also, I don't see why is it better to return a boolean instead of a type. Returning a type would make the code shorter and more readable (we can avoid the conditional operator in your example code, for instance), and it is still easy to check the returned type against the one of interest (e.g. Array) to get a boolean value. |
Perhaps it's not elegant, but it simplifies a lot of things. In my experience in Julia avoiding sub-typing when you can gets more freedom.
Yes that was my idea. Only the ones that are necessary, i.e. those that don't work with
I don't see the difference of having that on a single packages, only advantages (version tagging can be used to make sure things work when a package advances with respect to another). Anyway that was just to reply to your problem of being able to use different GPUs architectures/packages.
If you think that's the best thing go for it. 😄
This was to address what @lostella said:
With the boolean this hard-coding is avoided. |
@hakkelt Thanks for that CR! Do you think we can close this? |
Yes, I think we are done with that issue. |
Just to put this question into context: Recently I developed a similar package (https://github.com/hakkelt/FunctionOperators.jl), unfortunately, before I realized that such a package already exists (which is even more annoying having that I spend some time searching for such packages before I started to develop that package...). As AbstractOperators seems a quite good package and is definitely more mature and feature-rich than mine, so I would be willing to contribute to this package to add GPU support (or, more specifically, support for CuArray input and output), rather than continue developing FunctionOperators.
I looked at the source code, and it appears to me that the main design barrier of GPU support is that the calculus rules allocate buffers upon parsing the expression, but it turns out only later, when the operator is applied to the input array, whether the input is a CPU or a GPU array. To avoid this problem for FunctionOperators, I allocated the buffers only when the operator is first applied to the inputs, and later this buffer is re-used, of course. I think this method would be viable here as well, and it needs only a reasonable amount of refactoring.
The text was updated successfully, but these errors were encountered: