maxim5/time-series-machine-learning

why shuffling data?

ochoch opened this issue · 3 comments

Hello,
Nice and interresting work, I learned a lot.
During train and testing dataset building process, why are you shuffling data? I though that regarding time serie we should not shuffling data.

data_utils.py

def split_dataset(dataset, ratio=None):
size = dataset.size
if ratio is None:
ratio = _choose_optimal_train_ratio(size)

mask = np.zeros(size, dtype=np.bool_)
train_size = int(size * ratio)
mask[:train_size] = True
np.random.shuffle(mask)

train_x = dataset.x[mask, :]
train_y = dataset.y[mask]

mask = np.invert(mask)
test_x = dataset.x[mask, :]
test_y = dataset.y[mask]

return DataSet(train_x, train_y), DataSet(test_x, test_y)

Regards,

Hi @ochoch I think you're right. At that time I thought it was a good idea to shuffle the data, but I now I'd say it leads to overfitting and forward-looking bias.

Hi @ochoch sorry for the delay.

Unfortunately that's the way it is: there is so much noise and so little signal in financial data. If you are able to find a reliable signal more than 50% accurate, it's good enough and you can make money.

In terms of features: that's the key question. All ML algorithms that make money boil down to features. I haven't worked much on crypto data since then. Do you have any ideas in mind?