VivekPa/AIAlpha

Does Wavelet leak future price information into your input data?

liusida opened this issue · 6 comments

I noticed that most software uses Moving Average to smooth data, and the simple moving average obviously has a lag, so I am wondering if Wavelet leaks future price information into training input and test input data.

I would tend to agree -- the predictions seem too good to be true, especially in the case of (very) large variations that are correctly predicted. Any model would fail at these points. Not sure it comes from wavelets though, I'm currently investigating.
Anyway, AlphaAI is an amazing resource. Cheers!

I would also agree on the point, results are pretty good for a long variation as well.
This is one of my result
figure_1

I noticed that most software uses Moving Average to smooth data, and the simple moving average obviously has a lag, so I am wondering if Wavelet leaks future price information into training input and test input data.

I have made this mistake in so many ways, so many times... this is always the reason when I get good results (leaking future prices into "present") =(

I noticed that in the preprocessing.py file the loop takes the next 10 candles to perform the haar wavlet transform and calculating macd
x = np.array(self.stock_data.iloc[i: i + 11, j])
This is basically looking into the future.

Not sure about the wavlet transform, but the macd should Instead look back the last 10 candles.
Something like:
if (i > 11): x = np.array(self.stock_data.iloc[i: i - 11, j])

@JohnieBraaf

I found the same thing today. Because the medium quote his blog in my daily digest. I tried and ran this repo.

You know the best feature engineering in prediction is to use the future information to predict the future.

This is a fraud. I decided to unsubscribe my liberal fake news Medium mailing list.

the updated function is:

def make_wavelet_train(self):
        train_data = []
        test_data = []
        log_train_data = []
        for i in range(22,(len(self.stock_data)//10)*10 - 11):
            train = []
            log_ret = []
            for j in range(1, 6):
                # if i > 11:
                x = np.array(self.stock_data.iloc[i-11:i,j])
                # IPython.embed()
                (ca, cd) = pywt.dwt(x, "haar")
                cat = pywt.threshold(ca, np.std(ca), mode="soft")
                cdt = pywt.threshold(cd, np.std(cd), mode="soft")
                tx = pywt.idwt(cat, cdt, "haar")
                log = np.diff(np.log(tx))*100
                macd = np.mean(x[5:]) - np.mean(x)
                # ma = np.mean(x)
                sd = np.std(x)
                log_ret = np.append(log_ret, log)
                x_tech = np.append(macd*10, sd)
                train = np.append(train, x_tech)
            train_data.append(train)
            log_train_data.append(log_ret)
        trained = pd.DataFrame(train_data)
        trained.to_csv("preprocessing/indicators.csv")
        log_train = pd.DataFrame(log_train_data, index=None)
        log_train.to_csv("preprocessing/log_train.csv")
        # auto_train = pd.DataFrame(train_data[0:800])
        # auto_test = pd.DataFrame(train_data[801:1000])
        # auto_train.to_csv("auto_train.csv")
        # auto_test.to_csv("auto_test.csv")
        rbm_train = pd.DataFrame(log_train_data[0:int(self.split*self.feature_split*len(log_train_data))], index=None)
        rbm_train.to_csv("preprocessing/rbm_train.csv")
        rbm_test = pd.DataFrame(log_train_data[int(self.split*self.feature_split*len(log_train_data))+1:
                                               int(self.feature_split*len(log_train_data))])
        rbm_test.to_csv("preprocessing/rbm_test.csv")
        for i in range((len(self.stock_data) // 10) * 10 - 11):
            y = 100*np.log(self.stock_data.iloc[i + 11, 5] / self.stock_data.iloc[i + 10, 5])
            test_data.append(y)
        test = pd.DataFrame(test_data)
        test.to_csv("preprocessing/test_data.csv")