why shuffling data?

Question

why shuffling data?

ochoch opened this issue 6 years ago · 3 comments

Hello,
Nice and interresting work, I learned a lot.
During train and testing dataset building process, why are you shuffling data? I though that regarding time serie we should not shuffling data.

data_utils.py

def split_dataset(dataset, ratio=None):
size = dataset.size
if ratio is None:
ratio = _choose_optimal_train_ratio(size)

mask = np.zeros(size, dtype=np.bool_)
train_size = int(size * ratio)
mask[:train_size] = True
np.random.shuffle(mask)

train_x = dataset.x[mask, :]
train_y = dataset.y[mask]

mask = np.invert(mask)
test_x = dataset.x[mask, :]
test_y = dataset.y[mask]

return DataSet(train_x, train_y), DataSet(test_x, test_y)

Regards,

Answer 1 · 2019-04-20T09:19:26.000Z

Hi @ochoch I think you're right. At that time I thought it was a good idea to shuffle the data, but I now I'd say it leads to overfitting and forward-looking bias.

Answer 2 · 2019-04-24T13:49:40.000Z

Hi Maxim, Thanks for your reply. I played a bit with your implementation and add a provider (FXCM), using pyfxcm ( https://github.com/fxcm/RestAPI/tree/master/fxcmpy). At the end, as it is time consumming to connect to FXCM servers and they are not delivering the last bar(!), I integrate your python scripts with MT4. On each tick I mn providing the last data (replacement of get_latest_data method), I am providing a csv file, and replace raw_df dataframe with a read_csv method. Then I run predict.py and get prediction for the next bar and draw the result on a chart... [image: image.png] At this stage, I am also calculating some accuracy... And to be honest it is quit hard to get some tradable predictions... I have more or less following accuracy on forward testing : TF High Accuracy (%) Low Accuracy (%) m15 57.25 56.29 H4 56.25 63.55 D1 65.63 57.29 W1 52.08 58.33 Maybe we should add some additionnal features with selection feature algorithm. Any insights? Regards, och Le sam. 20 avr. 2019 à 11:19, Maxim Podkolzine <notifications@github.com> a écrit :

…

Hi @ochoch <https://github.com/ochoch> I think you're right. At that time I thought it was a good idea to shuffle the data, but I now I'd say it leads to overfitting and forward-looking bias. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABTHQD4XEF6YQBYEWAKYTLTPRLNZ7ANCNFSM4HHJFWYA> .

Answer 3 · 2019-05-16T13:09:50.000Z

Hi @ochoch sorry for the delay.

Unfortunately that's the way it is: there is so much noise and so little signal in financial data. If you are able to find a reliable signal more than 50% accurate, it's good enough and you can make money.

In terms of features: that's the key question. All ML algorithms that make money boil down to features. I haven't worked much on crypto data since then. Do you have any ideas in mind?