-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce AsyncNumber
to lazily copy numeric mapreduce
results to the host
#550
Conversation
@maleadt, the remaining CUDA.jl issues are because of method signatures, which can easily be updated to handle the change. I can open respective PR in CUDA.jl if you think this change is fine |
Surprisingly small change! Still not a fan though :-P How much additional pressure does this put on the GC?
Only without losing some "type safety", right? IIUC |
To avoid changing method signatures we can just change this line to: - m=maximum(I), n=maximum(J);
+ m=maximum(I)[], n=maximum(J)[];
You mean that the compiler wouldn't be able to infer the type? julia> x = AMDGPU.rand(Float32, 16);
julia> f(x::Number) = x + 2
f (generic function with 1 method)
julia> @code_warntype sum(x)
MethodInstance for sum(::ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer})
from sum(a::AbstractArray; dims, kw...) @ Base reducedim.jl:1010
Arguments
#self#::Core.Const(sum)
a::ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}
Body::GPUArrays.GPUNumber{ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}}
1 ─ nothing
│ %2 = Base.:(:)::Core.Const(Colon())
│ %3 = Core.NamedTuple()::Core.Const(NamedTuple())
│ %4 = Base.pairs(%3)::Core.Const(Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}())
│ %5 = Base.:(var"#sum#828")(%2, %4, #self#, a)::GPUArrays.GPUNumber{ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}}
└── return %5
julia> @code_warntype(f(sum(x)))
MethodInstance for f(::GPUArrays.GPUNumber{ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}})
from f(x::Number) @ Main REPL[3]:1
Arguments
#self#::Core.Const(f)
x::GPUArrays.GPUNumber{ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}}
Body::Float32
1 ─ %1 = (x + 2)::Float32
└── return %1 |
For the Flux model that I have and use for testing, machine consistently hangs (machine with a single AMD GPU) after several hundred training steps with or without this change... But generally, I really do think that something needs to be done to GC and GPU, since this is the biggest blocker for Julia + DL right now, unless you have 2+ GPUs or use headless server... |
I'm not thinking of type stability or inference, but the ability to restrict method definitions (e.g. for dispatch purposes). If we now have The fact that we're now indicating that GPUArrays.jl/src/host/mapreduce.jl Lines 111 to 134 in 80b8226
getproperty is forwarded to the inner value? That doesn't seem scalable though, as you can't forward everything to the inner value (and with arbitrary values being possible here, it's impossible to tell what to forward).
Not that I wouldn't want this to work, I just want to avoid breaking code. The fact that tests seem relatively OK is a good argument in favor though, and worst case we can always revert what's ultimately just an optimization. |
No,
Maybe we can have
Yeah, that's a consequence of making host transfers more explicit I guess... maybe_number(g::GPUNumber) = AN.number(g)
maybe_number(g) = g So that if the user writes code both for CPU and GPU he can use |
src/host/gpunumber.jl
Outdated
struct GPUNumber{T <: AbstractGPUArray} <: AN.AbstractNumber{T} | ||
val::T | ||
|
||
function GPUNumber(val::T) where T <: AbstractGPUArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pxl-th it might be a good idea to do a dropdims
or vec
in the constructor so this guy truly behaves like a number. The case I am thinking of is the following.
a = CUDA.rand(1,1,1,1) |> GPUNumber
x = CUDA.rand(4,4)
size(x .+ a) # want (4,4) but will likely get (4,4,1,1)
I haven't run your code but it seems like it would do something similar to
julia> rand(1,1,1,1) .+ rand(4,4) |> size
(4, 4, 1, 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, thanks!
But that's a breaking (and non-generic) change? So yeah, making this happen only for Numbers where we can be "sure" we implement the number interface transparently seems like the safer choice. But even then I'm still wary that this will break user code, so this will need thorough testing... |
So I've made it behave as usual when |
Also, some bike shedding: |
GPUNumber
to lazily copy mapreduce
result to hostAsyncNumber
to lazily copy numeric mapreduce
results to the host
Here's also a timeline for the Flux.jl model for CUDA.jl.
When running outside of profiler, one epoch takes ~35-40 minutes (down from 50-55 min), while PyTorch takes ~12 minutes per epoch. So there's still room for improvement (timeline profiling also looks much denser for PyTorch, there are no gaps between GPU kernels).
|
Remaining gaps could be either GC pauses. GC: pause 366.60ms. collected 5.253326MB. incr
GC: pause 112.91ms. collected 34.399342MB. full recollect
GC: pause 355.50ms. collected 0.455795MB. incr
GC: pause 114.99ms. collected 30.454090MB. full recollect
GC: pause 371.72ms. collected 0.405960MB. incr
GC: pause 118.76ms. collected 30.621845MB. full recollect
GC: pause 359.00ms. collected 5.808136MB. incr
GC: pause 124.97ms. collected 10.381500MB. full recollect
GC: pause 348.96ms. collected 5.253326MB. incr
GC: pause 112.76ms. collected 37.985279MB. full recollect
GC: pause 348.96ms. collected 0.455795MB. incr
GC: pause 111.61ms. collected 33.037785MB. full recollect
GC: pause 349.60ms. collected 0.405960MB. incr Or synchronizations during host -> device transfer. Could be that PyTorch's data loader prefetches data on a separate worker/stream to overlap with compute (don't think it is using pinned memory by default). |
You could use NVTX.jl to visualize GC pauses, https://juliagpu.github.io/NVTX.jl/dev/#Instrumenting-Julia-internals, and maybe add a couple of annotations to key CUDA.jl functions. |
Is the GC time spent marking/sweeping in Julia, or are the |
Can you set |
Selected green region is where
@vchuravy I've updated the above post with |
@pxl-th what model are you training? I'm curious how much of the performance gain is due to AsyncNumber vs JuliaDiff/ChainRules.jl#801 |
It was HiFi-GAN and JuliaDiff/ChainRules.jl#801 probably brings bigger performance gain. |
Why the close? Did this turn out to hurt GC too much? |
Ah... I accidentally removed my fork of |
I've invited you to the AMDGPU team and give you access to GPUArrays.jl, so you can just host your branch here if you want. |
Thanks! |
Introduce
GPUNumber
to store the resul ofmapreduce
across alldims
(i.e.dims = :
) instead of immediately transferring it to host.GPUNumber
copies its value to host when it is used as a regular number.But if it is used for broadcasting, it passes its underlying 1-element array without device-to-host copy.
Seems to be minimally breaking change, except for the return type of
sum
and the likes.Using AbstractNumbers.jl since it implements the whole
Number
interface conveniently.More context: FluxML/Zygote.jl#1513
Here's also a timeline profile for the Flux.jl model for the same region.
We can see that now there are no device-to-host copies.