Parameter count mismatch when converting `nn.TransformerEncoderLayer`

Question

Parameter count mismatch when converting `nn.TransformerEncoderLayer`

till-m opened this issue 10 months ago · 6 comments

my apologies for bugging you so soon again after you resolved my other request.
I noticed, that when I'm converting a nn.TransformerEncoderLayer, the parameter counts are mismatched, despite the fact that the conversion proceeds without issue (all green according to nobuco).

The problem seems to come from the linear1 which for some reason, doesn't get constructed with the right dimensions (or as a normal Dense layer). The correct number of parameters would be 128 x 256 + 256 = 33'024. However, the resulting tensorflow model seems to construct a layer of size 512 x 256 = 131'072.

torchinfo's summary:

==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
TransformerEncoderLayer                  [1, 512, 128]             --
├─MultiheadAttention: 1-1                [1, 512, 128]             66,048
├─Dropout: 1-2                           [1, 512, 128]             --
├─LayerNorm: 1-3                         [1, 512, 128]             256
├─Linear: 1-4                            [1, 512, 256]             33,024
├─Dropout: 1-5                           [1, 512, 256]             --
├─Linear: 1-6                            [1, 512, 128]             32,896
├─Dropout: 1-7                           [1, 512, 128]             --
├─LayerNorm: 1-8                         [1, 512, 128]             256
==========================================================================================
Total params: 132,480
Trainable params: 132,480
Non-trainable params: 0
Total mult-adds (M): 0.07
==========================================================================================
Input size (MB): 0.26
Forward/backward pass size (MB): 2.62
Params size (MB): 0.27
Estimated Total Size (MB): 3.15
==========================================================================================

Keras's summary:

__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
==================================================================================================
 input_1 (InputLayer)        [(1, 128, 512)]              0         []                            
                                                                                                  
 tf.compat.v1.transpose_2 (  (1, 512, 128)                0         ['input_1[0][0]']             
 TFOpLambda)                                                                                      
                                                                                                  
 multi_head_attention_2 (Mu  (1, 512, 128)                66048     ['tf.compat.v1.transpose_2[0][
 ltiHeadAttention)                                                  0]',                          
                                                                     'tf.compat.v1.transpose_2[0][
                                                                    0]',                          
                                                                     'tf.compat.v1.transpose_2[0][
                                                                    0]']                          
                                                                                                  
 dropout_1 (Dropout)         (1, 512, 128)                0         ['multi_head_attention_2[1][0]
                                                                    ']                            
                                                                                                  
 tf.compat.v1.transpose_3 (  (1, 128, 512)                0         ['dropout_1[0][0]']           
 TFOpLambda)                                                                                      
                                                                                                  
 weight_layer_4 (WeightLaye  (1, 512, 256)                131072    ['input_1[0][0]']             
 r)                                                                                               
                                                                                                  
 tf.__operators__.add (TFOp  (1, 128, 512)                0         ['input_1[0][0]',             
 Lambda)                                                             'tf.compat.v1.transpose_3[0][
                                                                    0]']                          
                                                                                                  
 dropout_2 (Dropout)         (1, 512, 256)                0         ['weight_layer_4[0][0]']      
                                                                                                  
 tf.compat.v1.transpose_4 (  (1, 512, 128)                0         ['tf.__operators__.add[0][0]']
 TFOpLambda)                                                                                      
                                                                                                  
 dense_1 (Dense)             (1, 512, 128)                32896     ['dropout_2[0][0]']           
                                                                                                  
 layer_normalization (Layer  (1, 512, 128)                256       ['tf.compat.v1.transpose_4[0][
 Normalization)                                                     0]']                          
                                                                                                  
 dropout_3 (Dropout)         (1, 512, 128)                0         ['dense_1[0][0]']             
                                                                                                  
 tf.__operators__.add_1 (TF  (1, 512, 128)                0         ['layer_normalization[0][0]', 
 OpLambda)                                                           'dropout_3[0][0]']           
                                                                                                  
 layer_normalization_1 (Lay  (1, 512, 128)                256       ['tf.__operators__.add_1[0][0]
 erNormalization)                                                   ']                            
                                                                                                  
 tf.compat.v1.transpose_5 (  (1, 128, 512)                0         ['layer_normalization_1[0][0]'
 TFOpLambda)                                                        ]                             
                                                                                                  
 tf.identity (TFOpLambda)    (1, 128, 512)                0         ['tf.compat.v1.transpose_5[0][
                                                                    0]']                          
                                                                                                  
==================================================================================================
Total params: 230528 (900.50 KB)
Trainable params: 230528 (900.50 KB)
Non-trainable params: 0 (0.00 Byte)
__________________________________________________________________________________________________

To reproduce:

import torch
import torch.nn as nn
import nobuco
from nobuco import ChannelOrder
from torchinfo import summary

pytorch_module = nn.TransformerEncoderLayer(128, 4, dim_feedforward=256, batch_first=True).eval()
#pytorch_module = nn.TransformerEncoderLayer(128, 4, dim_feedforward=256, batch_first=True).linear1.eval()
dummy_image = torch.rand(size=(1, 512, 128))

print(pytorch_module(dummy_image).mean())
print(summary(pytorch_module, dummy_image.shape))
keras_model = nobuco.pytorch_to_keras(
    pytorch_module,
    args=[dummy_image], kwargs=None,
    inputs_channel_order=ChannelOrder.TENSORFLOW,
    outputs_channel_order=ChannelOrder.TENSORFLOW
)

print(keras_model.summary())

Any idea why this could happen?

Answer 1 · 2024-02-28T14:59:05.000Z

Whoa, great find. Turns out, nn.TransformerEncoderLayer hits a rare corner case which I did not foresee. Long story short, because the activation function was passed as a default argument (activation: Union[str, Callable[[Tensor], Tensor]] = F.relu), and also because TransformerEncoderLayer is not a part of some third-party package but torch itself, Nobuco failed to trace the activation. Hence, its output was orphaned and included into the graph as a constant tensor (WeightLayer). nn.Linear is not to blame.

Made a quick fix, should work fine in v0.11.7.

Answer 2 · 2024-02-28T16:16:35.000Z

Thanks, it seems to have fixed the problem with the TransformerEncoderLayer.
However, I'm now running into small issue where buffers are registered as trainable parameters. Should I handle buffers differently?

Simple example:

import torch
import torch.nn as nn
import nobuco
from nobuco import ChannelOrder
from torchinfo import summary
import math


class Rotate2D(nn.Module):
    def __init__(self, theta=0.3) -> None:
        super().__init__()
        a = torch.Tensor([[math.cos(theta), -math.sin(theta)], [math.sin(theta), math.cos(theta)]])
        self.register_buffer('a', a)
    
    def forward(self, x):
        return torch.einsum('ij, ... j -> ... i', self.a, x)

pytorch_module = Rotate2D().eval()
dummy_image = torch.rand(size=(1, 100, 2))

print(pytorch_module(dummy_image).mean())
print(summary(pytorch_module, dummy_image.shape))
keras_model = nobuco.pytorch_to_keras(
    pytorch_module,
    args=[dummy_image], kwargs=None,
    inputs_channel_order=ChannelOrder.PYTORCH,
    outputs_channel_order=ChannelOrder.PYTORCH
)

print(keras_model.summary())

Answer 3 · 2024-02-28T16:32:16.000Z

Try constants_to_variables=False. In the future I should somehow check whether the tensor is trainable or not.

Answer 4 · 2024-02-28T17:19:16.000Z

Thanks! I'm very sorry, but I think I may have discovered another issue with Transformer layers. It looks like a stack of transformer layers constructed with TransformerEncoder only retains the Dense layers for the first TransformerEncoderLayer in the Stack.

Keras:
Total params: 199040 (777.50 KB)
Trainable params: 199040 (777.50 KB)


Pytorch:
trainable params: 264960 || all params: 264960

Difference: 264'960 - 199'040 = 65'920

Size of inverse bottleneck dense layers:
33024 + 32896 = 65'920

For n=3 layers, the parameter difference becomes 65'920*2 =131840, etc.

import torch
import torch.nn as nn
import nobuco
from nobuco import ChannelOrder

pytorch_module = nn.TransformerEncoder(
    nn.TransformerEncoderLayer(128, 4, dim_feedforward=256, batch_first=True).eval(),
    2)

dummy_image = torch.rand(size=(1, 512, 128))

print(pytorch_module(dummy_image).mean())
def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}"
    )
keras_model = nobuco.pytorch_to_keras(
    pytorch_module,
    args=[dummy_image], kwargs=None,
    inputs_channel_order=ChannelOrder.TENSORFLOW,
    outputs_channel_order=ChannelOrder.TENSORFLOW
)

print(keras_model.summary())
print_trainable_parameters(pytorch_module)

Answer 5 · 2024-02-29T12:45:13.000Z

I appreciate you perseverance! You just discovered a major bug: nobuco couldn't handle deepcopy. Like almost everything else, it traces the forward method of nn.Module via decorating. Decorated methods are stored in the class instance, so copy.deepcopy deepcopies them too. That means each copy's forward is bound to the original object, so both of them do the same computation. The bug is especially insidious as it breaks the original Pytorch model, and the debug log does not show any problems (all green). Anyways, fixed in v0.12.0 Good riddance.

Answer 6 · 2024-02-29T13:59:21.000Z

Thanks @AlexanderLutsenko, the parameter counts match now. Fingers crossed that the results are the same, too! Thanks for all the help! :) I will close this issue now.

I appreciate you perseverance!

I'm just very eager to avoid having to do anything myself.