not issue, just for help

Question

not issue, just for help

cl886699 opened this issue 2 years ago · 5 comments

Hi, boss, I want to rewrite this using tensorflow. But there are some problems. The loss can only drop to around 0.34, the output is all zeros. Here is the model code,

import tensorflow as tf
from typing import List
import tensorflow_addons as tfa
from tensorflow.keras.layers import Conv2D, PReLU, UpSampling2D, , Activation
from tensorflow.keras import Sequential, Model
import tensorflow.keras.applications as app


class PixelwiseLinear(Model):
    def __init__(
        self,
        fout: List[int],
        last_activation: Model = None,
    ) -> None:
        super().__init__()
        n = len(fout)
        self._linears = Sequential(
            [
                Sequential(
                    [Conv2D(fout[i], kernel_size=1, use_bias=True, kernel_initializer="he_normal"),
                    PReLU(shared_axes=[0, 1, 2, 3], alpha_initializer=tf.initializers.constant(0.25)) if i < n - 1 or last_activation is None else last_activation
                     ]
                )
                for i in range(n)
            ]
        )

    def call(self, x):
        # Processing the tensor:
        return self._linears(x)


class MixingBlock(Model):
    def __init__(
        self,
        ch_out: int,
    ):
        super().__init__()
        self._convmix = Sequential(
            [Conv2D(ch_out, 3, groups=ch_out, padding="SAME", kernel_initializer="he_normal"),
            PReLU(shared_axes=[0, 1, 2, 3], alpha_initializer=tf.initializers.constant(0.25)),
            tfa.layers.InstanceNormalization(center=False, scale=False, epsilon=1e-5)]
        )

    def call(self, x, y):
        # Packing the tensors and interleaving the channels:

        mixed = tf.stack([x, y], axis=1)
        mixed = tf.reshape(mixed, (tf.shape(x)[0], tf.shape(x)[1], tf.shape(x)[2], -1))

        # Mixing:
        return self._convmix(mixed)


class MixingMaskAttentionBlock(Model):
    """use the grouped convolution to make a sort of attention"""

    def __init__(
        self,
        ch_out: int,
        fout: List[int],
        generate_masked: bool = False,
    ):
        super().__init__()
        self._mixing = MixingBlock(ch_out)
        self._linear = PixelwiseLinear(fout)
        # self._final_normalization = tfa.layers.InstanceNormalization(center=False, scale=False, epsilon=1e-5) if generate_masked else None
        # self._mixing_out = MixingBlock(ch_out) if generate_masked else None

    def call(self, x, y):
        z_mix = self._mixing(x, y)
        z = self._linear(z_mix)
        return z
        # z_mix_out = 0 if self._mixing_out is None else self._mixing_out(x, y)

        # return (
        #     z
        #     if self._final_normalization is None
        #     else self._final_normalization(z_mix_out * z)
        # )


class UpMask(Model):
    def __init__(
        self,
        up_dimension: int,
        nin: int,
        nout: int,
    ):
        super().__init__()
        self._upsample = UpSampling2D(size=(up_dimension, up_dimension), interpolation="bilinear")
        self._convolution = Sequential(
            [Conv2D(nin, 3, 1, groups=nin, padding="SAME", kernel_initializer="he_normal"),
            PReLU(shared_axes=[0, 1, 2, 3], alpha_initializer=tf.initializers.constant(0.25)),
            tfa.layers.InstanceNormalization(center=False, scale=False, epsilon=1e-5),
            Conv2D(nout, kernel_size=1, strides=1, kernel_initializer="he_normal"),
            PReLU(shared_axes=[0, 1, 2, 3], alpha_initializer=tf.initializers.constant(0.25)),
            tfa.layers.InstanceNormalization(center=False, scale=False, epsilon=1e-5)]
        )

    def call(self, x, y=None):
        x = self._upsample(x)
        if y is not None:
            x = x * y
        return self._convolution(x)


class Eb4TinyCd(Model):
    def __init__(self):
        super().__init__()

        efficientb4 = getattr(app, 'EfficientNetB4')(include_top=False)
        outputs = [
            efficientb4.get_layer(ln).output
            for ln in ["stem_activation", "block1b_add", "block2d_add", "block3d_add"]
        ]
        # outputs = [
        #     efficientb4.get_layer(ln).output
        #     for ln in ["stem_activation", "block1b_drop", "block2d_drop", "block3d_drop"]
        # ]
        self._backbones = []
        inputs = [efficientb4.get_layer(ln).input for ln in
                  ['input_1', 'block1a_dwconv', 'block2a_expand_conv', 'block3a_expand_conv']]
        for inx, inout in enumerate(zip(inputs, outputs)):
            inl, out = inout
            self._backbones.append(Model(inputs=inl, outputs=out, name=f"backbone_{inx}"))

        # Initialize mixing blocks:
        self._first_mix = MixingMaskAttentionBlock(3, [10, 5, 1])
        self._mixing_mask1 = MixingMaskAttentionBlock(24, [12, 6, 1])
        self._mixing_mask2 = MixingMaskAttentionBlock(32, [16, 8, 1])
        self._mixing_mask3 = MixingBlock(56)
        self._mixing_mask = []
        self._mixing_mask.append(self._mixing_mask1)
        self._mixing_mask.append(self._mixing_mask2)
        self._mixing_mask.append(self._mixing_mask3)
        # Initialize Upsampling blocks:
        self._up1 = UpMask(2, 56, 64)
        self._up2 = UpMask(2, 64, 64)
        self._up3 = UpMask(2, 64, 32)
        self._up = []
        self._up.append(self._up1)
        self._up.append(self._up2)
        self._up.append(self._up3)

        # Final classification layer:
        self._classify = PixelwiseLinear([16, 8, 1], Activation(tf.nn.sigmoid))

    def call(self, ref, test):
        features = self._encode(ref, test)
        latents = self._decode(features)
        return self._classify(latents)

    def _encode(self, ref, test):
        features = [self._first_mix(ref, test)]
        for num, layer in enumerate(self._backbones):
            ref, test = layer(ref), layer(test)
            if num != 0:
                features.append(self._mixing_mask[num - 1](ref, test))
        return features

    def _decode(self, features):
        upping = features[-1]
        for i, j in enumerate(range(-2, -5, -1)):
            upping = self._up[i](upping, features[j])
        return upping


if __name__ == '__main__':
    inputs = tf.random.normal(shape=(1, 256, 256, 3), dtype=tf.float32)
    model = Eb4TinyCd()
    output = model(inputs, inputs)
    weights = model.trainable_weights
    for w in weights:
        wshape = w.shape
        sumtmp = 1
        for d in wshape:
            sumtmp *= d
        print(wshape, '  ',  w.name, '  ', sumtmp)

    model.summary()

loss function

self.loss = tf.keras.losses.BinaryCrossentropy(from_logits=False, reduction=Reduction.NONE)

Answer 1 · 2022-08-12T10:24:16.000Z

i find the bug, here
mixed = tf.stack([x, y], axis=1)
should be axis =-1

Answer 2 · 2022-08-12T18:34:26.000Z

Hi,

thank you for pointing out this difference between PyTorch and TensorFlow.

If you find any other differences that may interest those who want to reimplement this work in Tensorflow, please report them here so that it can be useful for everyone.

Thank you.

Answer 3 · 2022-08-18T08:55:27.000Z

for help again.
I trained this network use this tensorflow code，the best validation f1 is around 0.89， while train f1 is around 0.93. I try my best , but can not get the pytorch result, do you have any ideas that can help me improve the result.
training evn
4gpu, batch_size 8,
lr 0.35, CosineDecay
It is not stable using large lr even for 4gpu

I trained several times. I wonder why the losses suddenly rose

Answer 4 · 2022-08-18T10:59:17.000Z

Hello,

the first thing that comes to mind looking at the graphs you reported is that maybe you are using the Cosine Decay strategy with warm restrart (the drop around epoch 45 suggests it to me). I also tried a couple of experiments with the warm restart and in the end I decided not to use it because without the warm restart I got better performances. I would suggest you try to avoid it.

Secondly, as also reported in the paper, to obtain better results, and also to evaluate the stability of the model, we decided to run several experiments by optimizing the initial parameters of the optimizer.
We didn't have many resources available and were lucky enough to find the parameters reported in the training script after a few experiments.
To achieve performance in line with the implementation on PyTorch, I would recommend using NNI to optimize learning rate, weight decay and amsgrad.
Having 4 gpu at your disposal you could find the best 3 parameters for your implementation quite quickly and fill the 1-2 points that separate you from our best run. Or even do better :)

Let me know if these two strategies work!

Have a good time

Answer 5 · 2022-08-19T01:47:42.000Z

@AndreaCodegoni Thanks for your advice. I'll try.