# Warmup in Maskrcnn-benchmark and how does it work?

## Contents

In maskrcnn-benchmark,
there is some config parameters about warmup in solver (`WARMUP_FACTOR`

,
`WARMUP_ITERS`

, `WARMUP_METHOD`

). What is warmup, and how does it work?

The paper Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour gives a good explanation of why warmup is needed and explains different strategies of warmup.

## why do we need warmup

Suppose that we use learning rate $\eta$ on a single GPU with batch size $n$, when we train the network on 8 GPUs, now the batch size becomes $8n$. The learning rate also needs to change to suit the distributed training scenario. The author find that in practice, the linear scaling of learning rate works pretty well. For example, when we use initial learning rate 0.01 for one GPU, we may use an initial learning rate of 0.08 for distributed training, i.e., 0.01*8.

However, to use linear scaling of learning rate, certain condition have to be
met (see section 2.1 of the paper for details). On the initial training stage,
due to the rapid change of network parameters, the condition that makes linear
scaling work does not hold any more. So, in the initial training stage, the
author propose `warmup`

to tackle this issue.

The basic idea is that we should use a small learning rate than the value calculate by linear scaling policy. There are two strategies for warmup:

**constant**: Use a low learning rate than 0.08 for the initial few epochs.**gradual**: In the first few epochs, the learning rate is set to be lower than 0.08 and increased gradually to approach 0.08 as epoch number increases. In maskrcnn, a`linear`

warmup strategy is used for control warmup factor in the initial learning stage.

After the warmup epochs, the learning rate strategy would return to normal (you can change the learning rate based on the task at hand).

## How does linear warmup work in maskrcnn

The warmup method used by maskrcnn-benchmark can be found here:

```
def get_lr(self):
warmup_factor = 1
if self.last_epoch < self.warmup_iters:
if self.warmup_method == "constant":
warmup_factor = self.warmup_factor
elif self.warmup_method == "linear":
alpha = float(self.last_epoch) / self.warmup_iters
warmup_factor = self.warmup_factor * (1 - alpha) + alpha
return [
base_lr
* warmup_factor
* self.gamma ** bisect_right(self.milestones, self.last_epoch)
for base_lr in self.base_lrs
]
```

In the above code, `self.last_epoch`

is the current training iteration (because
maskrcnn-benchmark use iteration instead of the usual epoch to measure the
training process). `self.warmup_iters`

is the number of iterations for warmup
in the initial training stage. `self.warmup_factors`

are a constant (0.333 in
this case).

Only when current iteration number is below `self.warmup_iters`

, will the
`warmup_factor`

be used. Otherwise, it will be 1 and not affect the learning
rate.

When current iteration is below `warmup_iters`

and warmup method is `linear`

.
The warmup factor used is calculated as follows:

```
warmp_factor = 0.667 * (current_iter/warmup_iters) + 0.333
```

So as current iteration approaches `warmup_iters`

, `warmup_factor`

will
gradually approach 1. As a result, the learning rate used will approach base
learning rate.

Ref: