lmoroney/dlaicourse

Course 4 Week 4 Exercise Answer - windowed_dataset

srwight opened this issue · 4 comments

The cell that defines the function windowed_dataset in S+P Week 4 Exercise Answer is as follows:

def windowed_dataset(series, window_size, batch_size, shuffle_buffer):
    series = tf.expand_dims(series, axis=-1)
    ds = tf.data.Dataset.from_tensor_slices(series)
    ds = ds.window(window_size + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda w: w.batch(window_size + 1))
    ds = ds.shuffle(shuffle_buffer)
    ds = ds.map(lambda w: (w[:-1], w[1:]))
    return ds.batch(batch_size).prefetch(1)

The second-to-last line,

    ds = ds.map(lambda w: (w[:-1], w[1:]))

seems wrong. Should it not be the following?

    ds = ds.map(lambda w: (w[:-1], w[-1]))

As it stands, if the window is

1 2 3 4 5 6

then we're getting a tuple:

([1 2 3 4 5],[2 3 4 5 6])

However, it appears there's another issue that's allowing the whole thing to compile, and that is that the data isn't flattened after it's convolved. Now, to my mind, that means that the network still has to figure out the 6, but it also gets to claim to have 'predicted' 2-5. Wouldn't that falsely inflate accuracy?

wngaw commented

I find that using ds = ds.map(lambda w: (w[:-1], w[1:])) actually generates a better MAE (around 14) than ds = ds.map(lambda w: (w[:-1], w[-1:])) (around 50+). Perhaps it is due to @srwight's point that MAE is inflated as the model already knows about [2, 3, 4, 5] and is predicting for [2, 3, 4, 5] as well, resulting in data leakage.

I am also confused about how did the model fit the 'mse'? Since we want the target value is one value but the train_y is an window_size length vector.

In the notebook the last LSTM layer defined as follows:

  tf.keras.layers.LSTM(64, return_sequences=True) ,
  tf.keras.layers.Dense(30, activation="relu"),

It has return_sequences=True.

If you check model summary output shape is (None, None, 1), so model expects to output series, that's why ds = ds.map(lambda w: (w[:-1], w[1:])) is correct.

So, a year later, I've learned that for reasons we don't fully understand, time series models seem to work better when predicting the sequence. My guess is that having the loss from predicting those incorrectly at first informs the weights that eventually predict the next item (since it's fully connected after all). So this is a better network structure than if it only predicted the next item.

I maintain that the model structure is weak (because the dense layers should probably be time distributed), and I suggest that the course is weakened by leaving out the explanation. This notebook differs from the course video and caused me (and likely others) a fair amount of confusion.

That said, I'm closing this issue.