DnanaDev/Covid19-India-Analysis-and-Forecasting

[Data Leakage] Growth Ratio/Factor transformation before train-test split leaks data.

Closed this issue · 0 comments

Pretty subtle form of leakage in the Growth ratio and Growth Factor sections :

I have the confirmed cases time-series for n days:

Then I calculate the Growth Ratio for nth day:

and the plan is to model growth ratio and predict growth ratio on day n+1

which can then be simply transformed into the predicted cases:

The problem is when the Gr is calculated on the entire dataset before train-test split.

Ex, suppose you use cal. Gr and use first 10 days for train and 11... days for validation. Gr on 10th day:
Gr on 10th day:

and for validation set, the Gr on day 11th:

There is overlap, and now if you use lag features

The leakage is amplified.

[This is not a problem if time-series is not for a variable that is a ratio(encodes backwards data) then calculate lag and forward multi-step recursive forecast is fine]

Solution:
Split original series into train-test and then calculate the growth ratio and create lags. This loses 'number of lag' samples from start of validation set but should be leakage free.

Fixed for growth ratio, skipped for growth factor.