Parameter count mismatch when converting `nn.TransformerEncoderLayer`
till-m opened this issue · 6 comments
Hey @AlexanderLutsenko,
my apologies for bugging you so soon again after you resolved my other request.
I noticed, that when I'm converting a nn.TransformerEncoderLayer
, the parameter counts are mismatched, despite the fact that the conversion proceeds without issue (all green according to nobuco).
The problem seems to come from the linear1
which for some reason, doesn't get constructed with the right dimensions (or as a normal Dense
layer). The correct number of parameters would be 128 x 256 + 256 = 33'024
. However, the resulting tensorflow model seems to construct a layer of size 512 x 256 = 131'072
.
torchinfo
's summary:
==========================================================================================
Layer (type:depth-idx) Output Shape Param #
==========================================================================================
TransformerEncoderLayer [1, 512, 128] --
├─MultiheadAttention: 1-1 [1, 512, 128] 66,048
├─Dropout: 1-2 [1, 512, 128] --
├─LayerNorm: 1-3 [1, 512, 128] 256
├─Linear: 1-4 [1, 512, 256] 33,024
├─Dropout: 1-5 [1, 512, 256] --
├─Linear: 1-6 [1, 512, 128] 32,896
├─Dropout: 1-7 [1, 512, 128] --
├─LayerNorm: 1-8 [1, 512, 128] 256
==========================================================================================
Total params: 132,480
Trainable params: 132,480
Non-trainable params: 0
Total mult-adds (M): 0.07
==========================================================================================
Input size (MB): 0.26
Forward/backward pass size (MB): 2.62
Params size (MB): 0.27
Estimated Total Size (MB): 3.15
==========================================================================================
Keras's summary:
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) [(1, 128, 512)] 0 []
tf.compat.v1.transpose_2 ( (1, 512, 128) 0 ['input_1[0][0]']
TFOpLambda)
multi_head_attention_2 (Mu (1, 512, 128) 66048 ['tf.compat.v1.transpose_2[0][
ltiHeadAttention) 0]',
'tf.compat.v1.transpose_2[0][
0]',
'tf.compat.v1.transpose_2[0][
0]']
dropout_1 (Dropout) (1, 512, 128) 0 ['multi_head_attention_2[1][0]
']
tf.compat.v1.transpose_3 ( (1, 128, 512) 0 ['dropout_1[0][0]']
TFOpLambda)
weight_layer_4 (WeightLaye (1, 512, 256) 131072 ['input_1[0][0]']
r)
tf.__operators__.add (TFOp (1, 128, 512) 0 ['input_1[0][0]',
Lambda) 'tf.compat.v1.transpose_3[0][
0]']
dropout_2 (Dropout) (1, 512, 256) 0 ['weight_layer_4[0][0]']
tf.compat.v1.transpose_4 ( (1, 512, 128) 0 ['tf.__operators__.add[0][0]']
TFOpLambda)
dense_1 (Dense) (1, 512, 128) 32896 ['dropout_2[0][0]']
layer_normalization (Layer (1, 512, 128) 256 ['tf.compat.v1.transpose_4[0][
Normalization) 0]']
dropout_3 (Dropout) (1, 512, 128) 0 ['dense_1[0][0]']
tf.__operators__.add_1 (TF (1, 512, 128) 0 ['layer_normalization[0][0]',
OpLambda) 'dropout_3[0][0]']
layer_normalization_1 (Lay (1, 512, 128) 256 ['tf.__operators__.add_1[0][0]
erNormalization) ']
tf.compat.v1.transpose_5 ( (1, 128, 512) 0 ['layer_normalization_1[0][0]'
TFOpLambda) ]
tf.identity (TFOpLambda) (1, 128, 512) 0 ['tf.compat.v1.transpose_5[0][
0]']
==================================================================================================
Total params: 230528 (900.50 KB)
Trainable params: 230528 (900.50 KB)
Non-trainable params: 0 (0.00 Byte)
__________________________________________________________________________________________________
To reproduce:
import torch
import torch.nn as nn
import nobuco
from nobuco import ChannelOrder
from torchinfo import summary
pytorch_module = nn.TransformerEncoderLayer(128, 4, dim_feedforward=256, batch_first=True).eval()
#pytorch_module = nn.TransformerEncoderLayer(128, 4, dim_feedforward=256, batch_first=True).linear1.eval()
dummy_image = torch.rand(size=(1, 512, 128))
print(pytorch_module(dummy_image).mean())
print(summary(pytorch_module, dummy_image.shape))
keras_model = nobuco.pytorch_to_keras(
pytorch_module,
args=[dummy_image], kwargs=None,
inputs_channel_order=ChannelOrder.TENSORFLOW,
outputs_channel_order=ChannelOrder.TENSORFLOW
)
print(keras_model.summary())
Any idea why this could happen?
Whoa, great find. Turns out, nn.TransformerEncoderLayer
hits a rare corner case which I did not foresee. Long story short, because the activation function was passed as a default argument (activation: Union[str, Callable[[Tensor], Tensor]] = F.relu
), and also because TransformerEncoderLayer
is not a part of some third-party package but torch
itself, Nobuco failed to trace the activation. Hence, its output was orphaned and included into the graph as a constant tensor (WeightLayer
). nn.Linear
is not to blame.
Made a quick fix, should work fine in v0.11.7
.
Thanks, it seems to have fixed the problem with the TransformerEncoderLayer
.
However, I'm now running into small issue where buffers are registered as trainable parameters. Should I handle buffers differently?
Simple example:
import torch
import torch.nn as nn
import nobuco
from nobuco import ChannelOrder
from torchinfo import summary
import math
class Rotate2D(nn.Module):
def __init__(self, theta=0.3) -> None:
super().__init__()
a = torch.Tensor([[math.cos(theta), -math.sin(theta)], [math.sin(theta), math.cos(theta)]])
self.register_buffer('a', a)
def forward(self, x):
return torch.einsum('ij, ... j -> ... i', self.a, x)
pytorch_module = Rotate2D().eval()
dummy_image = torch.rand(size=(1, 100, 2))
print(pytorch_module(dummy_image).mean())
print(summary(pytorch_module, dummy_image.shape))
keras_model = nobuco.pytorch_to_keras(
pytorch_module,
args=[dummy_image], kwargs=None,
inputs_channel_order=ChannelOrder.PYTORCH,
outputs_channel_order=ChannelOrder.PYTORCH
)
print(keras_model.summary())
Try constants_to_variables=False
. In the future I should somehow check whether the tensor is trainable or not.
Thanks! I'm very sorry, but I think I may have discovered another issue with Transformer layers. It looks like a stack of transformer layers constructed with TransformerEncoder
only retains the Dense
layers for the first TransformerEncoderLayer
in the Stack.
Keras:
Total params: 199040 (777.50 KB)
Trainable params: 199040 (777.50 KB)
Pytorch:
trainable params: 264960 || all params: 264960
Difference: 264'960 - 199'040 = 65'920
Size of inverse bottleneck dense layers:
33024 + 32896 = 65'920
For n=3
layers, the parameter difference becomes 65'920*2 =131840
, etc.
import torch
import torch.nn as nn
import nobuco
from nobuco import ChannelOrder
pytorch_module = nn.TransformerEncoder(
nn.TransformerEncoderLayer(128, 4, dim_feedforward=256, batch_first=True).eval(),
2)
dummy_image = torch.rand(size=(1, 512, 128))
print(pytorch_module(dummy_image).mean())
def print_trainable_parameters(model):
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}"
)
keras_model = nobuco.pytorch_to_keras(
pytorch_module,
args=[dummy_image], kwargs=None,
inputs_channel_order=ChannelOrder.TENSORFLOW,
outputs_channel_order=ChannelOrder.TENSORFLOW
)
print(keras_model.summary())
print_trainable_parameters(pytorch_module)
I appreciate you perseverance! You just discovered a major bug: nobuco couldn't handle deepcopy. Like almost everything else, it traces the forward
method of nn.Module
via decorating. Decorated methods are stored in the class instance, so copy.deepcopy
deepcopies them too. That means each copy's forward
is bound to the original object, so both of them do the same computation. The bug is especially insidious as it breaks the original Pytorch model, and the debug log does not show any problems (all green). Anyways, fixed in v0.12.0
Good riddance.
Thanks @AlexanderLutsenko, the parameter counts match now. Fingers crossed that the results are the same, too! Thanks for all the help! :) I will close this issue now.
I appreciate you perseverance!
I'm just very eager to avoid having to do anything myself.