Regarding scaling of data
Opened this issue · 2 comments
KarthikaKP commented
I have seen that standardscaler.fit(X) is being used which which scale the entire data.But the usual practice is to fit on the training data and apply the same mean on testing and validation data set.I am new to this feild and doesnt know how to preprocess time series data.Kindly reply
Seanny123 commented
You are absolutely correct and this is an embarrassing mistake which should be corrected.
mikel-brostrom commented
Ill leave this piece of code here if somebody needs to solve this issue
and want to reuse the output scaler to inverse transform the predictions:
`def preprocess_data(dat, col_names, train_percentage):
# read dataset. Shape: (40560, 82)
proc_dat = dat.to_numpy()
# create one dedicated scaler for the input data
# and one for the output data
in_data_scaler = MinMaxScaler()
out_data_scaler = MinMaxScaler()
# separate target from features: (40560, 1) | (40560, 81)
mask = np.ones(proc_dat.shape[1], dtype=bool)
dat_cols = list(dat.columns)
for col_name in col_names:
mask[dat_cols.index(col_name)] = False
feats = proc_dat[:, mask]
targs = proc_dat[:, ~mask]
# fit the scalers on train set only
train_size = int(train_percentage * len(dat))
in_data_scaler.fit(feats[0:train_size - 1, :])
out_data_scaler.fit(targs[0:train_size - 1, :])
# transform features and targets for model training
feats = in_data_scaler.transform(feats)
targs = out_data_scaler.transform(targs)
return feats, targs, in_data_scaler, out_data_scaler`