titu1994/keras-adabound

suggestion: allow to train x2 or x3 bigger networks on same vram with TF backend

Closed this issue · 13 comments

same as my PR keras-team/keras-contrib#478
works only with TF backend

class AdaBound(Optimizer):
    """AdaBound optimizer.
    Default parameters follow those provided in the original paper.
    # Arguments
        lr: float >= 0. Learning rate.
        final_lr: float >= 0. Final learning rate.
        beta_1: float, 0 < beta < 1. Generally close to 1.
        beta_2: float, 0 < beta < 1. Generally close to 1.
        gamma: float >= 0. Convergence speed of the bound function.
        epsilon: float >= 0. Fuzz factor. If `None`, defaults to `K.epsilon()`.
        decay: float >= 0. Learning rate decay over each update.
        weight_decay: Weight decay weight.
        amsbound: boolean. Whether to apply the AMSBound variant of this
            algorithm.
        tf_cpu_mode: only for tensorflow backend
              0 - default, no changes.
              1 - allows to train x2 bigger network on same VRAM consuming RAM
              2 - allows to train x3 bigger network on same VRAM consuming RAM*2
                  and CPU power.
    # References
        - [Adaptive Gradient Methods with Dynamic Bound of Learning Rate]
          (https://openreview.net/forum?id=Bkg3g2R9FX)
        - [Adam - A Method for Stochastic Optimization]
          (https://arxiv.org/abs/1412.6980v8)
        - [On the Convergence of Adam and Beyond]
          (https://openreview.net/forum?id=ryQu7f-RZ)
    """

    def __init__(self, lr=0.001, final_lr=0.1, beta_1=0.9, beta_2=0.999, gamma=1e-3,
                 epsilon=None, decay=0., amsbound=False, weight_decay=0.0, tf_cpu_mode=0, **kwargs):
        super(AdaBound, self).__init__(**kwargs)

        if not 0. <= gamma <= 1.:
            raise ValueError("Invalid `gamma` parameter. Must lie in [0, 1] range.")

        with K.name_scope(self.__class__.__name__):
            self.iterations = K.variable(0, dtype='int64', name='iterations')
            self.lr = K.variable(lr, name='lr')
            self.beta_1 = K.variable(beta_1, name='beta_1')
            self.beta_2 = K.variable(beta_2, name='beta_2')
            self.decay = K.variable(decay, name='decay')

        self.final_lr = final_lr
        self.gamma = gamma

        if epsilon is None:
            epsilon = K.epsilon()
        self.epsilon = epsilon
        self.initial_decay = decay
        self.amsbound = amsbound

        self.weight_decay = float(weight_decay)
        self.base_lr = float(lr)
        self.tf_cpu_mode = tf_cpu_mode

    def get_updates(self, loss, params):
        grads = self.get_gradients(loss, params)
        self.updates = [K.update_add(self.iterations, 1)]

        lr = self.lr
        if self.initial_decay > 0:
            lr = lr * (1. / (1. + self.decay * K.cast(self.iterations,
                                                      K.dtype(self.decay))))

        t = K.cast(self.iterations, K.floatx()) + 1

        # Applies bounds on actual learning rate
        step_size = lr * (K.sqrt(1. - K.pow(self.beta_2, t)) /
                          (1. - K.pow(self.beta_1, t)))

        final_lr = self.final_lr * lr / self.base_lr
        lower_bound = final_lr * (1. - 1. / (self.gamma * t + 1.))
        upper_bound = final_lr * (1. + 1. / (self.gamma * t))

        e = K.tf.device("/cpu:0") if self.tf_cpu_mode > 0 else None
        if e: e.__enter__()
        ms = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
        vs = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
        if self.amsbound:
            vhats = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
        else:
            vhats = [K.zeros(1) for _ in params]
        if e: e.__exit__(None, None, None)
        
        self.weights = [self.iterations] + ms + vs + vhats

        for p, g, m, v, vhat in zip(params, grads, ms, vs, vhats):
            # apply weight decay
            if self.weight_decay != 0.:
                g += self.weight_decay * K.stop_gradient(p)

            e = K.tf.device("/cpu:0") if self.tf_cpu_mode == 2 else None
            if e: e.__enter__()                    
            m_t = (self.beta_1 * m) + (1. - self.beta_1) * g
            v_t = (self.beta_2 * v) + (1. - self.beta_2) * K.square(g)
            if self.amsbound:
                vhat_t = K.maximum(vhat, v_t)
                self.updates.append(K.update(vhat, vhat_t))
            if e: e.__exit__(None, None, None)
            
            if self.amsbound:
                denom = (K.sqrt(vhat_t) + self.epsilon)
            else:
                denom = (K.sqrt(v_t) + self.epsilon)                        

            # Compute the bounds
            step_size_p = step_size * K.ones_like(denom)
            step_size_p_bound = step_size_p / denom
            bounded_lr_t = m_t * K.minimum(K.maximum(step_size_p_bound,
                                                     lower_bound), upper_bound)

            p_t = p - bounded_lr_t

            self.updates.append(K.update(m, m_t))
            self.updates.append(K.update(v, v_t))
            new_p = p_t

            # Apply constraints.
            if getattr(p, 'constraint', None) is not None:
                new_p = p.constraint(new_p)

            self.updates.append(K.update(p, new_p))
        return self.updates

    def get_config(self):
        config = {'lr': float(K.get_value(self.lr)),
                  'final_lr': float(self.final_lr),
                  'beta_1': float(K.get_value(self.beta_1)),
                  'beta_2': float(K.get_value(self.beta_2)),
                  'gamma': float(self.gamma),
                  'decay': float(K.get_value(self.decay)),
                  'epsilon': self.epsilon,
                  'weight_decay': self.weight_decay,
                  'amsbound': self.amsbound}
        base_config = super(AdaBound, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

While this is an interesting application of device placement for larger models, the cost is in training time.

Your moving average weights are on the cpu, whereas the gradients of every parameter are on the gpu. With your device blocks, you are effectively shuttling gpu gradients on the cpu, performing the op and then shuttling it back onto the gpu.

This has several issues :

  1. Shuttling of gradients from GPU <-> CPU for large models will millions of parameters is done per batch. This costs too much time.
  2. The CPU must perform multiple tasks, multiprocess data loading (if the images are from ImageNet or external source in general), batching, shuffling, and finally now it must also incur the cost of synchronizing gradients and performing CPU ops on large matrices. This will cause a bottleneck on the IO pipeline. As IO is generally the major bottleneck anyway, this is highly inefficient.

This is fine, when one is willing to pay the price on compute time wrt larger models, but it is not feasible in general case.

EDIT:
Why not then just force the entire optimizer to be on the CPU device, if you incur the cost of device shuttling anyway. That way, at least the CPU ops can be streamlined.

In addition, Gradient Checkpointing likewise already does this in similar line of thought, recomputing gradients on requirement rather than preserve them on GPU RAM. You could look into that if memory is the bottleneck and time is not a consideration.

I already tested it and applied in my DeepFaceLab project ( deepfakes ).
8bs - 128x128 maximum face model (~500MB model files) for my 6 GB card with tf_cpu_mode=0
4bs - 256x256 (~1000MB model files) with tf_cpu_mode=1 , -10% slower. But it is due to model bigger.
8bs - 256x256 (~1000MB model files) with tf_cpu_mode=2 , -30% slower.

So this approach brings deepfakes to the new era.

if you dont like it, just close it :)
I just wanted to share the find.

trying AdaBound right after Adam
same lr, but final_lr = lr * 100

history of last 5k iters
NSFW pic

interesting :)

I believe you will find similar results with simply marking the entire optimizer to lie on the CPU, but im glad you found a good alternative. I'll probably review the PR on keras-contrib sometime if it gets merged.

I must ask you to remove the image though.

This is fine, when one is willing to pay the price on compute time wrt larger models, but it is not feasible in general case.

Batch size is very important parameter for GAN networks. So getting rid of optimizer's weights from VRAM, we can train higher batch size, sacrificing 10-20% of time per iteration.
Also I cannot feel noticeable performance loss on my coffelake machine with 2400Mhz 32GB ram.

Sure, if one can disregard the additional training time, then your approach is fine. I wont be merging it into this since I keep 1:1 equivalence with Keras proper.

Btw, a slight question, why not place the

            if self.amsbound:
                denom = (K.sqrt(vhat_t) + self.epsilon)
            else:
                denom = (K.sqrt(v_t) + self.epsilon)                        

            # Compute the bounds
            step_size_p = step_size * K.ones_like(denom)

inside the CPU block as well? That would offer even more memory saving since you dont need a K.ones_like() on the GPU then.

should be tested, thanks for the tip.

I must ask you to remove the image though.

and why?
your religion does not allow to look at the faces of women? :0

Informed consent. If someone is casually browsing, they should not be shown random nsfw information, unless provided behind a link that clearly states that the content is nsfw and therefore they implicitly are responsible for viewing it at their own discretion.

did not know that a bunch of clear women faces is not safe for work.