-
-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu(::DataLoader)
, take III
#2245
Changes from all commits
e35ce8b
7ab7c86
ae7e0c2
aecb711
2b7042b
db8aadb
592badd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -49,7 +49,7 @@ julia> Flux.GPU_BACKEND | |
"CUDA" | ||
``` | ||
|
||
## GPU Usage | ||
## Basic GPU Usage | ||
|
||
Support for array operations on other hardware backends, like GPUs, is provided by external packages like [CUDA](https://github.com/JuliaGPU/CUDA.jl). Flux is agnostic to array types, so we simply need to move model weights and data to the GPU and Flux will handle it. | ||
|
||
|
@@ -122,61 +122,48 @@ julia> x |> cpu | |
0.7766742 | ||
``` | ||
|
||
```@docs | ||
cpu | ||
gpu | ||
``` | ||
|
||
## Common GPU Workflows | ||
|
||
Some of the common workflows involving the use of GPUs are presented below. | ||
|
||
### Transferring Training Data | ||
## Transferring Training Data | ||
|
||
In order to train the model using the GPU both model and the training data have to be transferred to GPU memory. This process can be done with the `gpu` function in two different ways: | ||
In order to train the model using the GPU both model and the training data have to be transferred to GPU memory. Moving the data can be done in two different ways: | ||
|
||
1. Iterating over the batches in a [DataLoader](@ref) object transferring each one of the training batches at a time to the GPU. | ||
1. Iterating over the batches in a [`DataLoader`](@ref) object transferring each one of the training batches at a time to the GPU. This is recommended for large datasets. Done by hand, it might look like this: | ||
```julia | ||
train_loader = Flux.DataLoader((xtrain, ytrain), batchsize = 64, shuffle = true) | ||
# ... model, optimiser and loss definitions | ||
for epoch in 1:nepochs | ||
for (xtrain_batch, ytrain_batch) in train_loader | ||
x, y = gpu(xtrain_batch), gpu(ytrain_batch) | ||
gradients = gradient(() -> loss(x, y), parameters) | ||
Flux.Optimise.update!(optimiser, parameters, gradients) | ||
train_loader = Flux.DataLoader((X, Y), batchsize=64, shuffle=true) | ||
# ... model definition, optimiser setup | ||
for epoch in 1:epochs | ||
for (x_cpu, y_cpu) in train_loader | ||
x = gpu(x_cpu) | ||
y = gpu(y_cpu) | ||
grads = gradient(m -> loss(m, x, y), model) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've changed this example to use explicit gradient, and to be less verbose. |
||
Flux.update!(opt_state, model, grads[1]) | ||
end | ||
end | ||
``` | ||
|
||
2. Transferring all training data to the GPU at once before creating the [DataLoader](@ref) object. This is usually performed for smaller datasets which are sure to fit in the available GPU memory. Some possibilities are: | ||
```julia | ||
gpu_train_loader = Flux.DataLoader((xtrain |> gpu, ytrain |> gpu), batchsize = 32) | ||
``` | ||
```julia | ||
gpu_train_loader = Flux.DataLoader((xtrain, ytrain) |> gpu, batchsize = 32) | ||
``` | ||
Note that both `gpu` and `cpu` are smart enough to recurse through tuples and namedtuples. Another possibility is to use [`MLUtils.mapsobs`](https://juliaml.github.io/MLUtils.jl/dev/api/#MLUtils.mapobs) to push the data movement invocation into the background thread: | ||
```julia | ||
using MLUtils: mapobs | ||
# ... | ||
gpu_train_loader = Flux.DataLoader(mapobs(gpu, (xtrain, ytrain)), batchsize = 16) | ||
``` | ||
|
||
3. Wrapping the `DataLoader` in [`CUDA.CuIterator`](https://cuda.juliagpu.org/stable/usage/memory/#Batching-iterator) to efficiently move data to GPU on demand: | ||
Rather than write this out every time, you can just call `gpu(::DataLoader)`: | ||
```julia | ||
using CUDA: CuIterator | ||
train_loader = Flux.DataLoader((xtrain, ytrain), batchsize = 64, shuffle = true) | ||
# ... model, optimiser and loss definitions | ||
for epoch in 1:nepochs | ||
for (xtrain_batch, ytrain_batch) in CuIterator(train_loader) | ||
# ... | ||
gpu_train_loader = Flux.DataLoader((X, Y), batchsize=64, shuffle=true) |> gpu | ||
# ... model definition, optimiser setup | ||
for epoch in 1:epochs | ||
for (x, y) in gpu_train_loader | ||
grads = gradient(m -> loss(m, x, y), model) | ||
Flux.update!(opt_state, model, grads[1]) | ||
end | ||
end | ||
``` | ||
This is equivalent to `DataLoader(MLUtils.mapobs(gpu, (X, Y)); keywords...)`. | ||
Something similar can also be done with [`CUDA.CuIterator`](https://cuda.juliagpu.org/stable/usage/memory/#Batching-iterator), `gpu_train_loader = CUDA.CuIterator(train_loader)`. However, this only works with a limited number of data types: `first(train_loader)` should be a tuple (or `NamedTuple`) of arrays. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here we could hint at using train_loader = mapobs(preprocess_transform, train_loader)
gpu_train_loader = CUDA.CuIterator(train_loader) Also, mention when CuIterator should be preferred over gpu? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it preferred? This PR takes the line that it's not... it's tricky about what types, and finalize doesn't matter. So it's mentioned here in case people are already using it. If finalize does matter, then we should do #2240 instead. |
||
|
||
Note that this works with a limited number of data types. If `iterate(train_loader)` returns anything other than arrays, approach 1 or 2 is preferred. | ||
2. Transferring all training data to the GPU at once before creating the `DataLoader`. This is usually performed for smaller datasets which are sure to fit in the available GPU memory. | ||
```julia | ||
gpu_train_loader = Flux.DataLoader((X, Y) |> gpu, batchsize = 32) | ||
# ... | ||
for epoch in 1:epochs | ||
for (x, y) in gpu_train_loader | ||
# ... | ||
``` | ||
Here `(X, Y) |> gpu` applies [`gpu`](@ref) to both arrays, as it recurses into structures. | ||
|
||
### Saving GPU-Trained Models | ||
## Saving GPU-Trained Models | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change is just me reducing the number of levels of heading from 3 to 2. The file is a bit of a mess but no need for deep hierarchy. |
||
|
||
After the training process is done, one must always transfer the trained model back to the `cpu` memory scope before serializing or saving to disk. This can be done, as described in the previous section, with: | ||
```julia | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -391,3 +391,59 @@ function gpu(::FluxAMDAdaptor, x) | |
end | ||
|
||
function _amd end | ||
|
||
|
||
""" | ||
gpu(data::DataLoader) | ||
|
||
Transforms a given `DataLoader` to apply `gpu` to each batch of data, | ||
when iterated over. (If no GPU is available, this does nothing.) | ||
|
||
# Example | ||
|
||
```julia-repl | ||
julia> dl = Flux.DataLoader((x = ones(2,10), y='a':'j'), batchsize=3) | ||
4-element DataLoader(::NamedTuple{(:x, :y), Tuple{Matrix{Float64}, StepRange{Char, Int64}}}, batchsize=3) | ||
with first element: | ||
(; x = 2×3 Matrix{Float64}, y = 3-element StepRange{Char, Int64}) | ||
|
||
julia> first(dl) | ||
(x = [1.0 1.0 1.0; 1.0 1.0 1.0], y = 'a':1:'c') | ||
|
||
julia> c_dl = gpu(dl) | ||
4-element DataLoader(::MLUtils.MappedData{:auto, typeof(gpu), NamedTuple{(:x, :y), Tuple{Matrix{Float64}, StepRange{Char, Int64}}}}, batchsize=3) | ||
with first element: | ||
(; x = 2×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y = 3-element StepRange{Char, Int64}) | ||
|
||
julia> first(c_dl).x | ||
2×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}: | ||
1.0 1.0 1.0 | ||
1.0 1.0 1.0 | ||
``` | ||
|
||
For large datasets, this is preferred over moving all the data to | ||
the GPU before creating the `DataLoader`, like this: | ||
|
||
```julia-repl | ||
julia> Flux.DataLoader((x = ones(2,10), y=2:11) |> gpu, batchsize=3) | ||
4-element DataLoader(::NamedTuple{(:x, :y), Tuple{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, UnitRange{Int64}}}, batchsize=3) | ||
with first element: | ||
(; x = 2×3 CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y = 3-element UnitRange{Int64}) | ||
``` | ||
|
||
!!! warning | ||
This only works if `gpu` is applied directly to the `DataLoader`. | ||
While `gpu` acts recursively on Flux models and many basic Julia structs, | ||
it will not work on (say) a tuple of `DataLoader`s. | ||
""" | ||
function gpu(d::MLUtils.DataLoader) | ||
MLUtils.DataLoader(MLUtils.mapobs(gpu, d.data), | ||
d.batchsize, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of writing this out here, we could move it upstream: JuliaML/MLUtils.jl#153 |
||
d.buffer, | ||
d.partial, | ||
d.shuffle, | ||
d.parallel, | ||
d.collate, | ||
d.rng, | ||
) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These docstrings moved from a "guide" section to a "reference" section.