Skip to content

Commit

Permalink
gpu(::DataLoader), take III (#2245)
Browse files Browse the repository at this point in the history
* simpler MLUtils gpu(::DataLoader)

* docs

* also move cpu/gpu docstrings to a reference section

* doc fixes

* less verbose code in docs

* tweak words

* Apply 3 suggestions
  • Loading branch information
mcabbott authored May 1, 2023
1 parent 650699c commit 5790b73
Show file tree
Hide file tree
Showing 6 changed files with 138 additions and 44 deletions.
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ See also [github's page](https://github.com/FluxML/Flux.jl/releases) for a compl
## v0.13.16
* Most greek-letter keyword arguments are deprecated in favour of ascii.
Thus `LayerNorm(3; ϵ=1e-4)` (not `ε`!) should become `LayerNorm(3; eps=1e-4)`.
* `DataLoader(...) |> gpu` will now produce a special iterator, moving each batch as needed,
instead of giving an error.

## v0.13.15
* Added [MultiHeadAttention](https://github.com/FluxML/Flux.jl/pull/2146) layer.
Expand Down
75 changes: 31 additions & 44 deletions docs/src/gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ julia> Flux.GPU_BACKEND
"CUDA"
```

## GPU Usage
## Basic GPU Usage

Support for array operations on other hardware backends, like GPUs, is provided by external packages like [CUDA](https://github.com/JuliaGPU/CUDA.jl). Flux is agnostic to array types, so we simply need to move model weights and data to the GPU and Flux will handle it.

Expand Down Expand Up @@ -122,61 +122,48 @@ julia> x |> cpu
0.7766742
```

```@docs
cpu
gpu
```

## Common GPU Workflows

Some of the common workflows involving the use of GPUs are presented below.

### Transferring Training Data
## Transferring Training Data

In order to train the model using the GPU both model and the training data have to be transferred to GPU memory. This process can be done with the `gpu` function in two different ways:
In order to train the model using the GPU both model and the training data have to be transferred to GPU memory. Moving the data can be done in two different ways:

1. Iterating over the batches in a [DataLoader](@ref) object transferring each one of the training batches at a time to the GPU.
1. Iterating over the batches in a [`DataLoader`](@ref) object transferring each one of the training batches at a time to the GPU. This is recommended for large datasets. Done by hand, it might look like this:
```julia
train_loader = Flux.DataLoader((xtrain, ytrain), batchsize = 64, shuffle = true)
# ... model, optimiser and loss definitions
for epoch in 1:nepochs
for (xtrain_batch, ytrain_batch) in train_loader
x, y = gpu(xtrain_batch), gpu(ytrain_batch)
gradients = gradient(() -> loss(x, y), parameters)
Flux.Optimise.update!(optimiser, parameters, gradients)
train_loader = Flux.DataLoader((X, Y), batchsize=64, shuffle=true)
# ... model definition, optimiser setup
for epoch in 1:epochs
for (x_cpu, y_cpu) in train_loader
x = gpu(x_cpu)
y = gpu(y_cpu)
grads = gradient(m -> loss(m, x, y), model)
Flux.update!(opt_state, model, grads[1])
end
end
```

2. Transferring all training data to the GPU at once before creating the [DataLoader](@ref) object. This is usually performed for smaller datasets which are sure to fit in the available GPU memory. Some possibilities are:
```julia
gpu_train_loader = Flux.DataLoader((xtrain |> gpu, ytrain |> gpu), batchsize = 32)
```
```julia
gpu_train_loader = Flux.DataLoader((xtrain, ytrain) |> gpu, batchsize = 32)
```
Note that both `gpu` and `cpu` are smart enough to recurse through tuples and namedtuples. Another possibility is to use [`MLUtils.mapsobs`](https://juliaml.github.io/MLUtils.jl/dev/api/#MLUtils.mapobs) to push the data movement invocation into the background thread:
```julia
using MLUtils: mapobs
# ...
gpu_train_loader = Flux.DataLoader(mapobs(gpu, (xtrain, ytrain)), batchsize = 16)
```

3. Wrapping the `DataLoader` in [`CUDA.CuIterator`](https://cuda.juliagpu.org/stable/usage/memory/#Batching-iterator) to efficiently move data to GPU on demand:
Rather than write this out every time, you can just call `gpu(::DataLoader)`:
```julia
using CUDA: CuIterator
train_loader = Flux.DataLoader((xtrain, ytrain), batchsize = 64, shuffle = true)
# ... model, optimiser and loss definitions
for epoch in 1:nepochs
for (xtrain_batch, ytrain_batch) in CuIterator(train_loader)
# ...
gpu_train_loader = Flux.DataLoader((X, Y), batchsize=64, shuffle=true) |> gpu
# ... model definition, optimiser setup
for epoch in 1:epochs
for (x, y) in gpu_train_loader
grads = gradient(m -> loss(m, x, y), model)
Flux.update!(opt_state, model, grads[1])
end
end
```
This is equivalent to `DataLoader(MLUtils.mapobs(gpu, (X, Y)); keywords...)`.
Something similar can also be done with [`CUDA.CuIterator`](https://cuda.juliagpu.org/stable/usage/memory/#Batching-iterator), `gpu_train_loader = CUDA.CuIterator(train_loader)`. However, this only works with a limited number of data types: `first(train_loader)` should be a tuple (or `NamedTuple`) of arrays.

Note that this works with a limited number of data types. If `iterate(train_loader)` returns anything other than arrays, approach 1 or 2 is preferred.
2. Transferring all training data to the GPU at once before creating the `DataLoader`. This is usually performed for smaller datasets which are sure to fit in the available GPU memory.
```julia
gpu_train_loader = Flux.DataLoader((X, Y) |> gpu, batchsize = 32)
# ...
for epoch in 1:epochs
for (x, y) in gpu_train_loader
# ...
```
Here `(X, Y) |> gpu` applies [`gpu`](@ref) to both arrays, as it recurses into structures.

### Saving GPU-Trained Models
## Saving GPU-Trained Models

After the training process is done, one must always transfer the trained model back to the `cpu` memory scope before serializing or saving to disk. This can be done, as described in the previous section, with:
```julia
Expand Down
10 changes: 10 additions & 0 deletions docs/src/models/functors.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,13 @@ Functors.fcollect
Functors.functor
Functors.fmapstructure
```

## Moving models, or data, to the GPU

Flux provides some convenience functions based on `fmap`. Some ([`f16`](@ref Flux.f16), [`f32`](@ref Flux.f32), [`f64`](@ref Flux.f64)) change the precision of all arrays in a model. Others are used for moving a model to of from GPU memory:

```@docs
cpu
gpu(::Any)
gpu(::Flux.DataLoader)
```
56 changes: 56 additions & 0 deletions src/functor.jl
Original file line number Diff line number Diff line change
Expand Up @@ -391,3 +391,59 @@ function gpu(::FluxAMDAdaptor, x)
end

function _amd end


"""
gpu(data::DataLoader)
Transforms a given `DataLoader` to apply `gpu` to each batch of data,
when iterated over. (If no GPU is available, this does nothing.)
# Example
```julia-repl
julia> dl = Flux.DataLoader((x = ones(2,10), y='a':'j'), batchsize=3)
4-element DataLoader(::NamedTuple{(:x, :y), Tuple{Matrix{Float64}, StepRange{Char, Int64}}}, batchsize=3)
with first element:
(; x = 2×3 Matrix{Float64}, y = 3-element StepRange{Char, Int64})
julia> first(dl)
(x = [1.0 1.0 1.0; 1.0 1.0 1.0], y = 'a':1:'c')
julia> c_dl = gpu(dl)
4-element DataLoader(::MLUtils.MappedData{:auto, typeof(gpu), NamedTuple{(:x, :y), Tuple{Matrix{Float64}, StepRange{Char, Int64}}}}, batchsize=3)
with first element:
(; x = 2×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y = 3-element StepRange{Char, Int64})
julia> first(c_dl).x
2×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
1.0 1.0 1.0
1.0 1.0 1.0
```
For large datasets, this is preferred over moving all the data to
the GPU before creating the `DataLoader`, like this:
```julia-repl
julia> Flux.DataLoader((x = ones(2,10), y=2:11) |> gpu, batchsize=3)
4-element DataLoader(::NamedTuple{(:x, :y), Tuple{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, UnitRange{Int64}}}, batchsize=3)
with first element:
(; x = 2×3 CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y = 3-element UnitRange{Int64})
```
!!! warning
This only works if `gpu` is applied directly to the `DataLoader`.
While `gpu` acts recursively on Flux models and many basic Julia structs,
it will not work on (say) a tuple of `DataLoader`s.
"""
function gpu(d::MLUtils.DataLoader)
MLUtils.DataLoader(MLUtils.mapobs(gpu, d.data),
d.batchsize,
d.buffer,
d.partial,
d.shuffle,
d.parallel,
d.collate,
d.rng,
)
end
13 changes: 13 additions & 0 deletions test/amd/basic.jl
Original file line number Diff line number Diff line change
Expand Up @@ -101,3 +101,16 @@ end
gpu_autodiff_test(bn, x; atol=1f-3, allow_nothing=true)
end
end

@testset "gpu(::DataLoader)" begin
X = randn(Float64, 3, 33)
pre1 = Flux.DataLoader(X |> Flux.gpu; batchsize=13, shuffle=false)
post1 = Flux.DataLoader(X; batchsize=13, shuffle=false) |> Flux.gpu
for epoch in 1:2
for (p, q) in zip(pre1, post1)
@test p isa ROCArray{Float32}
@test q isa ROCArray{Float32}
@test p q
end
end
end
26 changes: 26 additions & 0 deletions test/cuda/cuda.jl
Original file line number Diff line number Diff line change
Expand Up @@ -178,3 +178,29 @@ end
@test cpu(xgpu) isa Vector{A2116}
@test cpu(gpu([CartesianIndex(1)])) isa Vector{CartesianIndex{1}}
end

@testset "gpu(::DataLoader)" begin
X = randn(Float64, 3, 33)
pre1 = Flux.DataLoader(X |> gpu; batchsize=13, shuffle=false)
post1 = Flux.DataLoader(X; batchsize=13, shuffle=false) |> gpu
for epoch in 1:2
for (p, q) in zip(pre1, post1)
@test p isa CuArray{Float32}
@test q isa CuArray{Float32}
@test p q
end
end

Y = Flux.onehotbatch(rand(0:2, 33), 0:2)
pre2 = Flux.DataLoader((x=X, y=Y) |> gpu; batchsize=7, shuffle=false)
post2 = Flux.DataLoader((x=X, y=Y); batchsize=7, shuffle=false) |> gpu
for (p, q) in zip(pre2, post2)
@test p.x == q.x
@test_skip p.y == q.y # https://github.com/FluxML/OneHotArrays.jl/issues/28 -- MethodError: getindex(::OneHotArrays.OneHotMatrix{UInt32, CuArray{UInt32, 1, CUDA.Mem.DeviceBuffer}}, ::Int64, ::Int64) is ambiguous
end

@test collect(pre2) isa Vector{<:NamedTuple{(:x, :y)}}
@test collect(post2) isa Vector{<:NamedTuple{(:x, :y)}} # collect makes no sense, but check eltype?

@test_throws Exception gpu(((x = Flux.DataLoader(X), y = Y),))
end

0 comments on commit 5790b73

Please sign in to comment.