Regarding scaling of data

Question

Regarding scaling of data

Opened this issue 6 years ago · 2 comments

I have seen that standardscaler.fit(X) is being used which which scale the entire data.But the usual practice is to fit on the training data and apply the same mean on testing and validation data set.I am new to this feild and doesnt know how to preprocess time series data.Kindly reply

Answer 1 · 2019-01-07T02:37:24.000Z

You are absolutely correct and this is an embarrassing mistake which should be corrected.

Answer 2 · 2020-07-28T08:32:29.000Z

Ill leave this piece of code here if somebody needs to solve this issue
and want to reuse the output scaler to inverse transform the predictions:

`def preprocess_data(dat, col_names, train_percentage):

# read dataset. Shape: (40560, 82)
proc_dat = dat.to_numpy()  

# create one dedicated scaler for the input data 
# and one for the output data
in_data_scaler = MinMaxScaler() 
out_data_scaler = MinMaxScaler()

# separate target from features: (40560, 1) | (40560, 81)
mask = np.ones(proc_dat.shape[1], dtype=bool)
dat_cols = list(dat.columns)
for col_name in col_names:
    mask[dat_cols.index(col_name)] = False

feats = proc_dat[:, mask]
targs = proc_dat[:, ~mask]

# fit the scalers on train set only
train_size = int(train_percentage * len(dat))
in_data_scaler.fit(feats[0:train_size - 1, :])
out_data_scaler.fit(targs[0:train_size - 1, :])

# transform features and targets for model training
feats = in_data_scaler.transform(feats)
targs = out_data_scaler.transform(targs)

return feats, targs, in_data_scaler, out_data_scaler`