Zero padding - possibly incorrect behavior?

Question

Zero padding - possibly incorrect behavior?

iceboundflame opened this issue 2 years ago · 1 comments

Thank you for sharing the code and paper, it has been very helpful. I think I may have found a subtle issue with the padding scheme and would appreciate another opinion.

Conceptually, we'd like every sequence input before the first to be zero. But I noticed that the implementation pads every Conv1d input with zeros, not just the first one. In my opinion, this is incorrect behavior for each layer beyond the first.

Here is a diagram of the issue.

The triangles represent padded inputs. The bottom row (sequence input) is padded with 0, which is correct. However, the first layer's outputs are also padded with 0 (red triangles) before feeding to the next layer. I think we should instead pad with a constant vector, the result of convolving an all-zero receptive field. (Resulting in conv1's bias term.)

Similarly, the next layer up should be padded with a constant vector, whose value is the result of convolving a receptive field with a constant value (the padding of the previous layer).

Impact: A network with receptive field $r$ will produce incorrect results prior to the $r$-th input. "Incorrect" in this case means at least inconsistent with its behavior in the steady state, far from the beginning of the input. This might be especially important with long receptive fields, where sequences are similar in length to the receptive field, because a substantial portion of the training examples will be using these wrong padding values.

Here's a simple test case that demonstrates that prepending a sequence of zeros to the input changes the output.

def test_tcn():
    torch.manual_seed(42)
    def init_weights(m):
        if isinstance(m, nn.Conv1d):
            if hasattr(m, 'weight_g'):
                # weight_norm was applied to this layer
                torch.nn.init.uniform_(m.weight_g)
                torch.nn.init.uniform_(m.weight_v)
                # XXX: not sure if this is correct way to initialize
            else:
                torch.nn.init.uniform_(m.weight)
            torch.nn.init.uniform_(m.bias)

    with torch.no_grad():
        net = tcn.TemporalConvNet(num_inputs=1, num_channels=[2, 1], kernel_size=2, dropout=0)
        net.apply(init_weights)
        print("Receptive field", net.receptive_field_size)

        for i in range(8):
            print(f"Padding with {i} zeros:",
                  net(torch.Tensor([[ [0] * i + [1] ]])))

        print("Zero input response:", net(torch.Tensor([[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]])))

Receptive field 7
Padding with 0 zeros: tensor([[[2.1018]]])
Padding with 1 zeros: tensor([[[1.3458, 2.2364]]])
Padding with 2 zeros: tensor([[[1.3458, 1.4805, 2.4149]]])
Padding with 3 zeros: tensor([[[1.3458, 1.4805, 1.6590, 2.4309]]])
Padding with 4 zeros: tensor([[[1.3458, 1.4805, 1.6590, 1.6749, 2.4466]]])
Padding with 5 zeros: tensor([[[1.3458, 1.4805, 1.6590, 1.6749, 1.6907, 2.4550]]])
Padding with 6 zeros: tensor([[[1.3458, 1.4805, 1.6590, 1.6749, 1.6907, 1.6991, 2.4550]]])
Padding with 7 zeros: tensor([[[1.3458, 1.4805, 1.6590, 1.6749, 1.6907, 1.6991, 1.6991, 2.4550]]])

Zero input response: tensor([[[1.3458, 1.4805, 1.6590, 1.6749, 1.6907, 1.6991, 1.6991, 1.6991,
          1.6991, 1.6991, 1.6991, 1.6991]]])

Clearly this TCN implementation is still able to achieve great results, so I am not yet sure of the practical impact. I'll experiment with changing it for my application.

Answer 1 · 2023-03-15T19:32:32.000Z

After each convolution in the TCN, we eliminate the first p elements of all channels where p is the number of padding elements. As a result, I don't believe this should be a problem if you use the 1D convolutions in the context of the provided TCN architecture? I.e., in one temporal residual block the layers

self.conv1 = weight_norm(nn.Conv1d(n_inputs, n_outputs, kernel_size,
                                   stride=stride, padding=padding, dilation=dilation))
self.chomp1 = Chomp1d(padding)
self.relu1 = nn.ReLU()
self.dropout1 = nn.Dropout(dropout)

are applied in this order. So we Chomp1 before doing another Conv1d. So, even though there are some initial "incorrect" elements at intermediate stages, we are continuously removing them and adding them back to be ultimately removed before outputting the final result.