there is some config parameters about warmup in solver (
WARMUP_METHOD ). What is warmup, and how does it work?
The paper Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour gives a good explanation of why warmup is needed and explains different strategies of warmup.
why do we need warmup
Suppose that we use learning rate $\eta$ on a single GPU with batch size $n$, when we train the network on 8 GPUs, now the batch size becomes $8n$. The learning rate also needs to change to suit the distributed training scenario. The author find that in practice, the linear scaling of learning rate works pretty well. For example, when we use initial learning rate 0.01 for one GPU, we may use an initial learning rate of 0.08 for distributed training, i.e., 0.01*8.
However, to use linear scaling of learning rate, certain condition have to be
met (see section 2.1 of the paper for details). On the initial training stage,
due to the rapid change of network parameters, the condition that makes linear
scaling work does not hold any more. So, in the initial training stage, the
warmup to tackle this issue.
The basic idea is that we should use a small learning rate than the value calculate by linear scaling policy. There are two strategies for warmup:
- constant: Use a low learning rate than 0.08 for the initial few epochs.
- gradual: In the first few epochs, the learning rate is set to be lower
than 0.08 and increased gradually to approach 0.08 as epoch number increases.
In maskrcnn, a
linearwarmup strategy is used for control warmup factor in the initial learning stage.
After the warmup epochs, the learning rate strategy would return to normal (you can change the learning rate based on the task at hand).
How does linear warmup work in maskrcnn
The warmup method used by maskrcnn-benchmark can be found here:
def get_lr(self): warmup_factor = 1 if self.last_epoch < self.warmup_iters: if self.warmup_method == "constant": warmup_factor = self.warmup_factor elif self.warmup_method == "linear": alpha = float(self.last_epoch) / self.warmup_iters warmup_factor = self.warmup_factor * (1 - alpha) + alpha return [ base_lr * warmup_factor * self.gamma ** bisect_right(self.milestones, self.last_epoch) for base_lr in self.base_lrs ]
In the above code,
self.last_epoch is the current training iteration (because
maskrcnn-benchmark use iteration instead of the usual epoch to measure the
self.warmup_iters is the number of iterations for warmup
in the initial training stage.
self.warmup_factors are a constant (0.333 in
Only when current iteration number is below
self.warmup_iters, will the
warmup_factor be used. Otherwise, it will be 1 and not affect the learning
When current iteration is below
warmup_iters and warmup method is
The warmup factor used is calculated as follows:
warmp_factor = 0.667 * (current_iter/warmup_iters) + 0.333
So as current iteration approaches
gradually approach 1. As a result, the learning rate used will approach base
License CC BY-NC-ND 4.0