In this tutorial we are going to be looking at the PolyLRScheduler
in the timm
library.
PolyLRScheduler is very similar to CosineLRScheduler and TanhLRScheduler.
Difference is PolyLRScheduler use Polynomial function to anneal learning rate.
It is cyclic, can do warmup, add noise and k-decay.
The schedule looks something like:
To train models using the PolyLRScheduler
we simply update the training script args passed by passing in --sched poly
parameter alongside the necessary hyperparams. In this section we will also look at how each of the hyperparams update the PolyLRScheduler
scheduler.
The training command to use PolyLRScheduler
scheduler looks something like:
python train.py ../imagenette2-320/ --sched poly
Availible parameters are:
--epochs - initial number of epoch to train. default 300.
--lr - learning rate (default: 0.05)
--min-lr - lower lr bound for cyclic schedulers that hit 0 (1e-5)
--lr-k-decay - 'learning rate k-decay for cosine/poly (default: 1.0)
--decay-rate - polynomial power, (default: 0.1)
cycle parameters:
--lr-cycle-limit - learning rate cycle limit, cycles enabled if > 1
--lr-cycle-decay - amount to decay each learning rate cycle (default: 0.5)
--lr-cycle-mul - learning rate cycle len multiplier (default: 1.0)
warmup parameters:
--warmup-lr' - warmup learning rate (default: 0.0001)
--warmup-epochs - epochs to warmup LR, if scheduler supports (default: 3)
noise parameters:
--lr-noise - learning rate noise on/off epoch percentages
--lr-noise-pct - learning rate noise limit percent (default: 0.67)
--seed - random seed (default: 42) to seed noise generator.
Note! PolyLRScheduler is cyclyc sheduler, so real number of train epoch will differ from --epochs number! \ If we lunch script with default settings, it will train for 310 epochs - 300 defaupt for --epochs and 10 default for --cooldown-epochs. \ If we lunch script with parameters:--epochs 50 --lr-cycle-limit 2 \ It will be 110 epochs - two cyles by 50 epochs plus 10 for cooldown.
PolyLRScheduler
accepts two required arguments - an optimizer
and t_initial
, and also some hyperparams which we will look into in detail below.
Basic usage like this:
from timm.scheduler.poly_lr import PolyLRScheduler
scheduler = PolyLRScheduler(optimizer, t_initial=num_epoch)
optimizer
is object of torch.optim.Optimizer
t_initial
- initial number of epochs to train. It will be different from t_initial
when using cycle arguments, see detailed explanation and examples below.
Default schedule:
scheduler = PolyLRScheduler(optimizer, t_initial=50)
plot_lr(scheduler)
"Power" of polynomial function, default is 0.5.\
Note, when you start training script, power
sets by --decay-rate
parameter, that default is 0.1\
When power=1
annealing is linear.
Lets look at default and compare with 1. and 2.
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial)
plot_lr(scheduler, label='power=0.5, default')
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial, power=1)
plot_lr(scheduler, label='power=1')
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial, power=2)
plot_lr(scheduler, label='power=2')
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial, power=0.1)
plot_lr(scheduler, label='power=0.1, default from train script')
plt.legend();
lr_min
is value of lower lr bound, default is 0.
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial)
plot_lr(scheduler, label='default')
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial, lr_min=0.01)
plot_lr(scheduler, label='lr_min=0.01')
plt.legend();
k_decay
k_decay rate.
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial)
plot_lr(scheduler, label='default')
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial, k_decay=2.)
plot_lr(scheduler, label='k_decay=2.')
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial, k_decay=.5)
plot_lr(scheduler, label='k_decay=0.5')
plt.legend();
The number of cycles.
Note, what full namber of epochs will be different with t_initial.
after which to decay the learning rate where the new learning rate value equals lr * decay_rate
.
t_initial = 50
print(f"{t_initial=}")
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial, cycle_limit=2)
plot_lr(scheduler)
total_epochs = scheduler.get_cycle_length()
print(f"{total_epochs=}")
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial, cycle_limit=3)
plot_lr(scheduler)
total_epochs = scheduler.get_cycle_length()
print(f"{t_initial=}")
print(f"{total_epochs=}")
When cycle_decay
> 0 and <1., at every cycle the starting learning rate is decayed by new learning rate which equals lr * cycle_decay
. So if cycle_decay=0.5
, then in that case, the new learning rate becomes half the initial lr
.
Default is 1., its mean no decay.
t_initial = 50
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial, cycle_limit=3)
plot_lr(scheduler, label='default, 1')
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial, cycle_limit=3, cycle_decay=0.5)
plot_lr(scheduler, label="cycle_decay=0.5")
plt.legend();
cycle_mul
is cycle len multiplier. So, if cycle_mul=2
, next cycle will be twice longer.
t_initial = 50
print(f"{t_initial=}")
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial, cycle_limit=2, cycle_mul=2)
total_epochs_2cycles = scheduler.get_cycle_length()
print(f"{total_epochs_2cycles=}")
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial, cycle_limit=3, cycle_mul=2)
total_epochs_3cycles = scheduler.get_cycle_length()
print(f"{total_epochs_3cycles=}")
num_epoch = 50
cycle_limit=3
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial, cycle_limit=cycle_limit)
plot_lr(scheduler, label='default, 1')
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial, cycle_limit=cycle_limit, cycle_mul=1.5)
plot_lr(scheduler, label="cycle_mul=1.5")
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial, cycle_limit=cycle_limit, cycle_mul=2)
plot_lr(scheduler, label="cycle_mul=2")
plt.legend();
Defines the number of warmup epochs.
The initial learning rate during warmup. Default is 0.
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial)
plot_lr(scheduler, label='default')
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial, warmup_t=2)
plot_lr(scheduler, label='warmup, default warmup_lr_init')
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial, warmup_t=2, warmup_lr_init=0.05)
plot_lr(scheduler, label='warmup, warmup_lr_init=0.05')
plt.legend();
As we can see by setting up warmup_t
and warmup_lr_init
, the scheduler first starts with a value of warmup_lr_init
, then during warmup_t
number of epochs gradually progresses up to the LR value at epoch warmup_t + 1
.
If warmup_prefix
is True
, after warmup annealing starts from initial LR value.
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial)
plot_lr(scheduler, label='no warmup')
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial, warmup_t=10, warmup_prefix=True)
plot_lr(scheduler, label='warmup_prefix=True')
scheduler = PolyLRScheduler(optimizer, t_initial=t_initial, warmup_t=10)
plot_lr(scheduler, label='warmup_prefix=False')
plt.legend();
If it is number - its number of epoch when noise starts.
If list or tuple (of two elements) - first and second element is epoch number range, when noise applied.
The upper and lower limit of noise.
Percentage of noise to add.
scheduler = PolyLRScheduler(optimizer, t_initial=100, noise_range_t=60)
plot_noisy_lr(scheduler, label='noise_pct=0.65, def')
scheduler = PolyLRScheduler(optimizer, t_initial=100, noise_range_t=[10, 40], noise_pct=0.2)
plot_noisy_lr(scheduler, label='noise_pct=0.2')
plt.legend();
Noise standard deviation. Now it is not implemented.
Seed to use to add random noise.
If set to False, the learning rates returned for epoch t
are None
.
scheduler = PolyLRScheduler(optimizer, t_initial=5, t_in_epochs=False)
lr_per_epoch = calculate_lr(scheduler)
lr_per_epoch[:5]
If True, then inside each param group of the optimizer
a new field is set called initial_{field_name}
where field_name
refers to the field in param group that we are scheduling. Typically field_name='lr'
.