Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu(::DataLoader), take III #2245

Merged
merged 7 commits into from
May 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ See also [github's page](https://github.com/FluxML/Flux.jl/releases) for a compl
## v0.13.16
* Most greek-letter keyword arguments are deprecated in favour of ascii.
Thus `LayerNorm(3; ϵ=1e-4)` (not `ε`!) should become `LayerNorm(3; eps=1e-4)`.
* `DataLoader(...) |> gpu` will now produce a special iterator, moving each batch as needed,
instead of giving an error.

## v0.13.15
* Added [MultiHeadAttention](https://github.com/FluxML/Flux.jl/pull/2146) layer.
Expand Down
75 changes: 31 additions & 44 deletions docs/src/gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ julia> Flux.GPU_BACKEND
"CUDA"
```

## GPU Usage
## Basic GPU Usage

Support for array operations on other hardware backends, like GPUs, is provided by external packages like [CUDA](https://github.com/JuliaGPU/CUDA.jl). Flux is agnostic to array types, so we simply need to move model weights and data to the GPU and Flux will handle it.

Expand Down Expand Up @@ -122,61 +122,48 @@ julia> x |> cpu
0.7766742
```

```@docs
cpu
gpu
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docstrings moved from a "guide" section to a "reference" section.

```

## Common GPU Workflows

Some of the common workflows involving the use of GPUs are presented below.

### Transferring Training Data
## Transferring Training Data

In order to train the model using the GPU both model and the training data have to be transferred to GPU memory. This process can be done with the `gpu` function in two different ways:
In order to train the model using the GPU both model and the training data have to be transferred to GPU memory. Moving the data can be done in two different ways:

1. Iterating over the batches in a [DataLoader](@ref) object transferring each one of the training batches at a time to the GPU.
1. Iterating over the batches in a [`DataLoader`](@ref) object transferring each one of the training batches at a time to the GPU. This is recommended for large datasets. Done by hand, it might look like this:
```julia
train_loader = Flux.DataLoader((xtrain, ytrain), batchsize = 64, shuffle = true)
# ... model, optimiser and loss definitions
for epoch in 1:nepochs
for (xtrain_batch, ytrain_batch) in train_loader
x, y = gpu(xtrain_batch), gpu(ytrain_batch)
gradients = gradient(() -> loss(x, y), parameters)
Flux.Optimise.update!(optimiser, parameters, gradients)
train_loader = Flux.DataLoader((X, Y), batchsize=64, shuffle=true)
# ... model definition, optimiser setup
for epoch in 1:epochs
for (x_cpu, y_cpu) in train_loader
x = gpu(x_cpu)
y = gpu(y_cpu)
grads = gradient(m -> loss(m, x, y), model)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this example to use explicit gradient, and to be less verbose.

Flux.update!(opt_state, model, grads[1])
end
end
```

2. Transferring all training data to the GPU at once before creating the [DataLoader](@ref) object. This is usually performed for smaller datasets which are sure to fit in the available GPU memory. Some possibilities are:
```julia
gpu_train_loader = Flux.DataLoader((xtrain |> gpu, ytrain |> gpu), batchsize = 32)
```
```julia
gpu_train_loader = Flux.DataLoader((xtrain, ytrain) |> gpu, batchsize = 32)
```
Note that both `gpu` and `cpu` are smart enough to recurse through tuples and namedtuples. Another possibility is to use [`MLUtils.mapsobs`](https://juliaml.github.io/MLUtils.jl/dev/api/#MLUtils.mapobs) to push the data movement invocation into the background thread:
```julia
using MLUtils: mapobs
# ...
gpu_train_loader = Flux.DataLoader(mapobs(gpu, (xtrain, ytrain)), batchsize = 16)
```

3. Wrapping the `DataLoader` in [`CUDA.CuIterator`](https://cuda.juliagpu.org/stable/usage/memory/#Batching-iterator) to efficiently move data to GPU on demand:
Rather than write this out every time, you can just call `gpu(::DataLoader)`:
```julia
using CUDA: CuIterator
train_loader = Flux.DataLoader((xtrain, ytrain), batchsize = 64, shuffle = true)
# ... model, optimiser and loss definitions
for epoch in 1:nepochs
for (xtrain_batch, ytrain_batch) in CuIterator(train_loader)
# ...
gpu_train_loader = Flux.DataLoader((X, Y), batchsize=64, shuffle=true) |> gpu
# ... model definition, optimiser setup
for epoch in 1:epochs
for (x, y) in gpu_train_loader
grads = gradient(m -> loss(m, x, y), model)
Flux.update!(opt_state, model, grads[1])
end
end
```
This is equivalent to `DataLoader(MLUtils.mapobs(gpu, (X, Y)); keywords...)`.
Something similar can also be done with [`CUDA.CuIterator`](https://cuda.juliagpu.org/stable/usage/memory/#Batching-iterator), `gpu_train_loader = CUDA.CuIterator(train_loader)`. However, this only works with a limited number of data types: `first(train_loader)` should be a tuple (or `NamedTuple`) of arrays.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we could hint at using mapobs to transform the dataset into something CUIterator compatible. Could be a short example like

train_loader = mapobs(preprocess_transform, train_loader)
gpu_train_loader = CUDA.CuIterator(train_loader)

Also, mention when CuIterator should be preferred over gpu?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it preferred? This PR takes the line that it's not... it's tricky about what types, and finalize doesn't matter. So it's mentioned here in case people are already using it.

If finalize does matter, then we should do #2240 instead.


Note that this works with a limited number of data types. If `iterate(train_loader)` returns anything other than arrays, approach 1 or 2 is preferred.
2. Transferring all training data to the GPU at once before creating the `DataLoader`. This is usually performed for smaller datasets which are sure to fit in the available GPU memory.
```julia
gpu_train_loader = Flux.DataLoader((X, Y) |> gpu, batchsize = 32)
# ...
for epoch in 1:epochs
for (x, y) in gpu_train_loader
# ...
```
Here `(X, Y) |> gpu` applies [`gpu`](@ref) to both arrays, as it recurses into structures.

### Saving GPU-Trained Models
## Saving GPU-Trained Models
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is just me reducing the number of levels of heading from 3 to 2. The file is a bit of a mess but no need for deep hierarchy.


After the training process is done, one must always transfer the trained model back to the `cpu` memory scope before serializing or saving to disk. This can be done, as described in the previous section, with:
```julia
Expand Down
10 changes: 10 additions & 0 deletions docs/src/models/functors.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,13 @@ Functors.fcollect
Functors.functor
Functors.fmapstructure
```

## Moving models, or data, to the GPU

Flux provides some convenience functions based on `fmap`. Some ([`f16`](@ref Flux.f16), [`f32`](@ref Flux.f32), [`f64`](@ref Flux.f64)) change the precision of all arrays in a model. Others are used for moving a model to of from GPU memory:

```@docs
cpu
gpu(::Any)
gpu(::Flux.DataLoader)
```
56 changes: 56 additions & 0 deletions src/functor.jl
Original file line number Diff line number Diff line change
Expand Up @@ -391,3 +391,59 @@ function gpu(::FluxAMDAdaptor, x)
end

function _amd end


"""
gpu(data::DataLoader)

Transforms a given `DataLoader` to apply `gpu` to each batch of data,
when iterated over. (If no GPU is available, this does nothing.)

# Example

```julia-repl
julia> dl = Flux.DataLoader((x = ones(2,10), y='a':'j'), batchsize=3)
4-element DataLoader(::NamedTuple{(:x, :y), Tuple{Matrix{Float64}, StepRange{Char, Int64}}}, batchsize=3)
with first element:
(; x = 2×3 Matrix{Float64}, y = 3-element StepRange{Char, Int64})

julia> first(dl)
(x = [1.0 1.0 1.0; 1.0 1.0 1.0], y = 'a':1:'c')

julia> c_dl = gpu(dl)
4-element DataLoader(::MLUtils.MappedData{:auto, typeof(gpu), NamedTuple{(:x, :y), Tuple{Matrix{Float64}, StepRange{Char, Int64}}}}, batchsize=3)
with first element:
(; x = 2×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y = 3-element StepRange{Char, Int64})

julia> first(c_dl).x
2×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
1.0 1.0 1.0
1.0 1.0 1.0
```

For large datasets, this is preferred over moving all the data to
the GPU before creating the `DataLoader`, like this:

```julia-repl
julia> Flux.DataLoader((x = ones(2,10), y=2:11) |> gpu, batchsize=3)
4-element DataLoader(::NamedTuple{(:x, :y), Tuple{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, UnitRange{Int64}}}, batchsize=3)
with first element:
(; x = 2×3 CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y = 3-element UnitRange{Int64})
```

!!! warning
This only works if `gpu` is applied directly to the `DataLoader`.
While `gpu` acts recursively on Flux models and many basic Julia structs,
it will not work on (say) a tuple of `DataLoader`s.
"""
function gpu(d::MLUtils.DataLoader)
MLUtils.DataLoader(MLUtils.mapobs(gpu, d.data),
d.batchsize,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of writing this out here, we could move it upstream: JuliaML/MLUtils.jl#153

d.buffer,
d.partial,
d.shuffle,
d.parallel,
d.collate,
d.rng,
)
end
13 changes: 13 additions & 0 deletions test/amd/basic.jl
Original file line number Diff line number Diff line change
Expand Up @@ -101,3 +101,16 @@ end
gpu_autodiff_test(bn, x; atol=1f-3, allow_nothing=true)
end
end

@testset "gpu(::DataLoader)" begin
X = randn(Float64, 3, 33)
pre1 = Flux.DataLoader(X |> Flux.gpu; batchsize=13, shuffle=false)
post1 = Flux.DataLoader(X; batchsize=13, shuffle=false) |> Flux.gpu
for epoch in 1:2
for (p, q) in zip(pre1, post1)
@test p isa ROCArray{Float32}
@test q isa ROCArray{Float32}
@test p ≈ q
end
end
end
26 changes: 26 additions & 0 deletions test/cuda/cuda.jl
Original file line number Diff line number Diff line change
Expand Up @@ -178,3 +178,29 @@ end
@test cpu(xgpu) isa Vector{A2116}
@test cpu(gpu([CartesianIndex(1)])) isa Vector{CartesianIndex{1}}
end

@testset "gpu(::DataLoader)" begin
X = randn(Float64, 3, 33)
pre1 = Flux.DataLoader(X |> gpu; batchsize=13, shuffle=false)
post1 = Flux.DataLoader(X; batchsize=13, shuffle=false) |> gpu
for epoch in 1:2
for (p, q) in zip(pre1, post1)
@test p isa CuArray{Float32}
@test q isa CuArray{Float32}
@test p ≈ q
end
end

Y = Flux.onehotbatch(rand(0:2, 33), 0:2)
pre2 = Flux.DataLoader((x=X, y=Y) |> gpu; batchsize=7, shuffle=false)
post2 = Flux.DataLoader((x=X, y=Y); batchsize=7, shuffle=false) |> gpu
for (p, q) in zip(pre2, post2)
@test p.x == q.x
@test_skip p.y == q.y # https://github.com/FluxML/OneHotArrays.jl/issues/28 -- MethodError: getindex(::OneHotArrays.OneHotMatrix{UInt32, CuArray{UInt32, 1, CUDA.Mem.DeviceBuffer}}, ::Int64, ::Int64) is ambiguous
end

@test collect(pre2) isa Vector{<:NamedTuple{(:x, :y)}}
@test collect(post2) isa Vector{<:NamedTuple{(:x, :y)}} # collect makes no sense, but check eltype?

@test_throws Exception gpu(((x = Flux.DataLoader(X), y = Y),))
end