why temporal pad at all?
david-waterworth opened this issue · 6 comments
Nice work! I'm researching time series regression using machine learning so I'm looking at LSTM, TCN and Transformers based models and getting good results with your model.
One general question? I'm not sure I understand the reason why we pad each layer of a TCN at all. I understand that it ensures each layer produces a sequence of the same length so there's a benefit in that your predictions are aligned with your inputs. But it's very similar to initialising an AR(p) model with a vector of zeros when you predict forward - the initial predictions will all be "wrong" until the effect of the initial state has decayed out. LSTM's also have this issue - most applications seem to set the initial state per batch to zero which results in transients errors at the start of the batch (some authors train a separate model to estimate the initial state which I've had good success with). I would assume this would impact training as well and it seems to make sense to mask out the start of the output sequence when calculating the loss or the model may try and adapt to "fix" the impact of the wrong ic.
Certainly when I train a regression-based TCN I can observe transient errors at the start of the prediction - i.e. the diagram below underpredicts for the first 96 samples (that's 1 day of 15minute electricity consumption) then overpredicts for the first week before settling down. Interested in your thoughts.
Also, one general observation - the prediction from TCN seem noisier than LSTM, I thought the long AR window might filter out more noise than it has. Plus it's quite sensitive to learning rate - low learning rate produces a very noisey output sequence.
Thanks for the interest in our work! You are absolutely right (and that's what I did)--- you can discard the first few tokens of the sequence when predicting with TCN/LSTM as these elements typically come with insufficient context information (and therefore provides biased/noisy loss). However, it doesn't mean we don't need to pad the sequence, because otherwise the sequence will be compressed to be shorter and shorter as you get deeper in the architecture. What you want instead is to pad to preserve the sequence length, and discard the first few time steps of the final output (which have short history).
Thanks @jerrybai1995, I thought though that the reduction in sequence length at each hidden layer would only drop values which are computed using at least one of the padded values from the previous layer anyway, i.e. if instead of padding with 0 you used NaN (if pytorch allowed it) then would anything change other than the first n points in the final output where n is the length of the receptive area (maybe +/- 1) which will be dropped anyway? I probably need to step through it on the whiteboard again tomorrow...
Your understanding is correct only for 1 layer networks. Think about a convolutional kernel of receptive field n. At first you have sequence length T. You are right that the output sequence (without padding) will have length T-n. However, if your TCN has k layers, then the hidden sequence will have shorter and shorter length: T-n, T-2n, T-3n, ..., T-kn. So you will have shorter and shorter sequences...
I'm basing my understanding on this blog post https://theblog.github.io/post/convolution-in-autoregressive-neural-networks/
I understand that not padding reduces the hidden sequence lengths - but in the image, I believe not padding would result in the hidden layer losing the first 3 items, and the output the first 6? But all of these values are partially computed with the padded zeros. So if you didn't pad, then compute the loss by dropping the first 6 ground truth values isn't this the same as padding then dropping the first 6 items from both the predictions and ground truths?
They are equivalent indeed, but my point is, this "amount of dropping" will then be directly correlated with your # of layers: as I said, after k layers, you are effectively dropping the first k*n elements. This is something that we want to avoid, because:
- We don't want to lose too much data unnecessarily. We only want to drop a reasonable amount, which should be independent of our choice of model (which includes the choice of k), and not waste data.
- More importantly, we drop these first few tokens because we believe they "don't have enough context information." But what decides a "sufficient context"? It should be the domain/task. For example, in NLP/language tasks, probably the last 20 steps are important; in speech probably the last 100 steps are important. What we don't want, is to have this quantity tied to the "depth of the network" (i.e., the k). It should be something that we can control for, even if the same model (same k) is used on very different applications.
Therefore, if you do pad the sequence, then you can specify in your loss 1) how you want to weigh/drop these insufficient-context tokens, and 2) by how much. Such flexibility is offered by padding only.
@jerrybai1995 ah yes that makes eminent sense. My initial exploratory analysis with time series regression using your model is very promising. I did note that although I used a depth of 10 - resulting in a receptive field of n~1024 - that the predictions and ground truths converged << n. Your idea of weighting makes a lot of sense. Also perhaps using mean/variance rather than min/max scaling (with tanh activation) would reduce this further. My plan this week is to use optuna to investigate the best hyperparameters.