From 89e1084517cd40506a8522ed6a7fdef2edfd83a3 Mon Sep 17 00:00:00 2001 From: Carlo Lucibello Date: Mon, 30 Dec 2024 18:45:01 +0100 Subject: [PATCH] move schedulers --- docs/src/guide/training/training.md | 33 +++++++++++++++++++++++ docs/src/reference/training/optimisers.md | 32 +--------------------- 2 files changed, 34 insertions(+), 31 deletions(-) diff --git a/docs/src/guide/training/training.md b/docs/src/guide/training/training.md index 1b79a3e8f4..32576d9193 100644 --- a/docs/src/guide/training/training.md +++ b/docs/src/guide/training/training.md @@ -337,6 +337,39 @@ opt_state = Flux.setup(Adam(0.02), bimodel) Flux.adjust!(opt_state.layers.enc, 0.03) ``` + +## Scheduling Optimisers + +In practice, it is fairly common to schedule the learning rate of an optimiser to obtain faster convergence. There are a variety of popular scheduling policies, and you can find implementations of them in [ParameterSchedulers.jl](http://fluxml.ai/ParameterSchedulers.jl/stable). The documentation for ParameterSchedulers.jl provides a more detailed overview of the different scheduling policies, and how to use them with Flux optimisers. Below, we provide a brief snippet illustrating a [cosine annealing](https://arxiv.org/pdf/1608.03983.pdf) schedule with a momentum optimiser. + +First, we import ParameterSchedulers.jl and initialize a cosine annealing schedule to vary the learning rate between `1e-4` and `1e-2` every 10 epochs. We also create a new [`Momentum`](@ref Optimisers.Momentum) optimiser. +```julia +using ParameterSchedulers + +opt_state = Flux.setup(Momentum(), model) +schedule = Cos(λ0 = 1e-4, λ1 = 1e-2, period = 10) +for (eta, epoch) in zip(schedule, 1:100) + Flux.adjust!(opt_state, eta) + # your training code here +end +``` +`schedule` can also be indexed (e.g. `schedule(100)`) or iterated like any iterator in Julia. + +ParameterSchedulers.jl schedules are stateless (they don't store their iteration state). If you want a _stateful_ schedule, you can use `ParameterSchedulers.Stateful`: +```julia +using ParameterSchedulers: Stateful, next! + +schedule = Stateful(Cos(λ0 = 1e-4, λ1 = 1e-2, period = 10)) +for epoch in 1:100 + Flux.adjust!(opt_state, next!(schedule)) + # your training code here +end +``` + +Finally, a scheduling function can be incorporated into the optimser's state, advanced at each gradient update step, and possibly passed to the `train!` function. See [this section](https://fluxml.ai/ParameterSchedulers.jl/stable/tutorials/optimizers/#Working-with-Flux-optimizers) of ParameterSchedulers.jl documentation for more details. + +ParameterSchedulers.jl allows for many more scheduling policies including arbitrary functions, looping any function with a given period, or sequences of many schedules. See the [ParameterSchedulers.jl documentation](https://fluxml.ai/ParameterSchedulers.jl/stable) for more info. + ## Freezing layer parameters To completely disable training of some part of the model, use [`freeze!`](@ref Flux.freeze!). diff --git a/docs/src/reference/training/optimisers.md b/docs/src/reference/training/optimisers.md index e70454deda..e441602751 100644 --- a/docs/src/reference/training/optimisers.md +++ b/docs/src/reference/training/optimisers.md @@ -67,36 +67,6 @@ It is possible to compose optimisers for some added flexibility. Optimisers.OptimiserChain ``` -## Scheduling Optimisers - -In practice, it is fairly common to schedule the learning rate of an optimiser to obtain faster convergence. There are a variety of popular scheduling policies, and you can find implementations of them in [ParameterSchedulers.jl](http://fluxml.ai/ParameterSchedulers.jl/stable). The documentation for ParameterSchedulers.jl provides a more detailed overview of the different scheduling policies, and how to use them with Flux optimisers. Below, we provide a brief snippet illustrating a [cosine annealing](https://arxiv.org/pdf/1608.03983.pdf) schedule with a momentum optimiser. - -First, we import ParameterSchedulers.jl and initialize a cosine annealing schedule to vary the learning rate between `1e-4` and `1e-2` every 10 steps. We also create a new [`Momentum`](@ref Optimisers.Momentum) optimiser. -```julia -using ParameterSchedulers - -opt = Momentum() -schedule = Cos(λ0 = 1e-4, λ1 = 1e-2, period = 10) -for (eta, epoch) in zip(schedule, 1:100) - opt.eta = eta - # your training code here -end -``` -`schedule` can also be indexed (e.g. `schedule(100)`) or iterated like any iterator in Julia. - -ParameterSchedulers.jl schedules are stateless (they don't store their iteration state). If you want a _stateful_ schedule, you can use `ParameterSchedulers.Stateful`: -```julia -using ParameterSchedulers: Stateful, next! - -schedule = Stateful(Cos(λ0 = 1e-4, λ1 = 1e-2, period = 10)) -for epoch in 1:100 - opt.eta = next!(schedule) - # your training code here -end -``` - -ParameterSchedulers.jl allows for many more scheduling policies including arbitrary functions, looping any function with a given period, or sequences of many schedules. See the ParameterSchedulers.jl documentation for more info. - ## Decays Similar to optimisers, Flux also defines some simple decays that can be used in conjunction with other optimisers, or standalone. @@ -111,7 +81,7 @@ Optimisers.WeightDecay Gradient clipping is useful for training recurrent neural networks, which have a tendency to suffer from the exploding gradient problem. An example usage is ```julia -opt = OptimiserChain(ClipValue(1e-3), Adam(1e-3)) +opt = OptimiserChain(ClipGrad(1e-3), Adam(1e-3)) ``` ```@docs