raw_to_Xy producing weird results

Hi, I am using a dataframe that looks something like this when printed in jupyter notebook:

which passed the assert checks and has

assert isinstance(df, pd.DataFrame) # True
assert isinstance(df.index, pd.DatetimeIndex) # True
assert isinstance(df.columns, pd.MultiIndex) # True

n_timesteps = len(df)
n_channels = len(df.columns.levels[1])
n_assets = len(df.columns.levels[0])

print(n_timesteps, n_channels, n_assets)

2188 timesteps, 5 channels, and 20 assets.

The dataframe has no nan values df.isnull().values.any() # False.

However, after passing it through raw_to_Xy() function,

lookback, gap, horizon = 5, 2, 4

X, timestamps, y, asset_names, indicators = raw_to_Xy(df, lookback=lookback,
                                                      gap=gap, freq='B',
                                                      horizon=horizon)

n_samples =  n_timesteps - lookback - horizon - gap + 1

print('n_samples:', n_samples)
print('X.shape:', X.shape)
print('timestamps.shape:', timestamps.shape)
print('y.shape:', y.shape)
print('len(asset_names):', len(asset_names))
print('indicators', indicators)

returns

n_samples: 2178
X.shape: (2321, 5, 5, 19)
timestamps.shape: (2321,)
y.shape: (2321, 5, 4, 19)
len(asset_names): 19
indicators ['Close', 'High', 'Low', 'Open', 'Volume']

Two issues in what was returned is that:

I somehow have more samples (2321 as shown in X, timestamps, and y variables) than timestamps (2188). Expected number of samples is 2178 as shown by the n_samples variable
It returns less assets (19) than given (20). See len(asset_names)

Thank you for the question:)

The raw_to_Xy resamples the data so that the specified frequency freq is matched. That means that you probably have some business days missing in your original dataframe and the function forward fills them in the background. See
how the new index is created here:

deepdow/deepdow/utils.py

Line 257 in f123a82

index = pd.date_range(start=raw_data.index[0], end=raw_data.index[-1], freq=freq)

With that being said, the variable n_samples was the expected number of samples for the dataset in the documentation. However, it is not a general formula that will work for every dataset.

My guess is that one of your assets did not make it through the below check

deepdow/deepdow/utils.py

Line 263 in f123a82

is_valid = np.all(np.isfinite(new[a])) and np.all(new[a] > 0)

That is, the channels of this asset contain at least one element equal to:

np.nan (but you probably excluded this)
infinity
a nonpositive number

The raw_to_Xy is a one size fits all dataset preparation tool. I wrote it so that first time users do not have to spend too much time on data preparation. I would definitely encourage you to adjust it for your needs.

The raw_to_Xy resamples the data so that the specified frequency freq is matched. That means that you probably have some business days missing in your original dataframe and the function forward fills them in the background

Making sure the input has Monday to Friday filled using freq='B' although there might be a public holiday on that date is unusual for me as most repositories I come across regarding trading just use the dates where the market is open. Anyway just thought this was an interesting approach.

For this check is_valid = np.all(np.isfinite(new[a])) and np.all(new[a] > 0) , it was this np.all(new[a] > 0) check that was failing and because some of my trading volume is 0. I think it would be appropriate to make it np.all(new[a] >= 0) instead as if no one trades a certain stock on that day then the trading volume would be 0.

On the other hand, maybe you wouldn't even need this np.all(new[a] > 0) check as it is possible to have daily news sentiment score as an input channel(feature) as well, and its possible for those score to be less than 0.

My guess is that np.all(new[a] > 0) was to make sure all prices are > 0 as it should be, but I think the probability of having a negative price is very low as most people download their price data from a source, which ultimately comes from the exchange anyway which are positive. Just food for thought :)

EDIT 1
I now realise that raw_to_Xy converts inputs to purely returns. does this mean the current setup of deepdow does not deal with other forms of features eg volume/daily news sentiment ?

EDIT 2
I have another suggestion on potentially making raw_to_Xy more useful, maybe this line

deepdow/deepdow/utils.py

Line 267 in f123a82

asset_names = sorted(list(set(asset_names) - set(to_exclude)))

that excludes those assets that does not meet the assert checks should come after

deepdow/deepdow/utils.py

Line 269 in f123a82

    
           absolute = new.iloc[:, new.columns.get_level_values(0).isin(asset_names)][asset_names]  # sort

and

deepdow/deepdow/utils.py

Line 269 in f123a82

    
           absolute = new.iloc[:, new.columns.get_level_values(0).isin(asset_names)][asset_names]  # sort

this way, if say it is known before hand that 'High' price has some missing/bad values for say asset A, and I set in the included_indicators to exclude 'High' as shown:

X, timestamps, y, asset_names, indicators = raw_to_Xy(df, lookback=lookback, gap=gap, freq='B', horizon=horizon,  included_indicators=['Open', 'Low', 'Close'])

the returned X and y won't drop the entire asset A. This makes included_indicators kind of having a dual purpose

excluding indicators if not required
excluding indicators if there are error known

same goes for included_assets. Another food for thought :)

Regarding the timestamps, the motivation behind implementing it this way was that I was afraid that first time users would feed data with big gaps and irregularities along the time dimension. Creating the rolling window features this way would not really make much sense. If you are confident that your datetime index is the right one I encourage you to just drop this resampling in the code.

As you correctly pointed out, the raw_to_Xy is converting absolute quantities in the initial DataFrame to relative
quantities (log(x_t / x_{t-1}) if use_log=True). It does so for all channels. That is why the nonnegative condition is present in the code (division by zero is not great). So if in your case the starting indicators are:

Open price
High price
Low price
Close price
Volume

They are converted to

Open return
High return
Low return
Close return
Volume return (percent change)

If you wonder why this is done, then there are 2 reasons:

The loss function machinery assumes that one of the channels of the target tensor y represents this single period price returns. You can use thereturns_channel parameter present in the constructor of all relevant losses (see https://github.com/jankrepl/deepdow/blob/master/deepdow/losses.py) to tell the loss function what channel it is. The remaining channels do not matter for the computation of most final losses.
It is a common practice in a lot of time series applications to perform first order differencing to obtain a time series that is more stationary. Returns are definitely closer to being stationary then absolute quantities (prices, volumes). Note that the same argument can be also seen from the ML point of you - having standardized features is a good thing. Features made out of returns will be closer to being correctly standardized than features made out of prices.

With that being said, nothing prevents you from having channels with absolute values. To do this you would simply adjust raw_to_Xy such that the return computation is skipped. Additionally, deepdow is able to do standardization
(see the example https://deepdow.readthedocs.io/en/latest/auto_examples/end_to_end/getting_started.html#sphx-glr-auto-examples-end-to-end-getting-started-py) so you do not have to worry that much about gradient descent not working correctly.

As I said, the raw_to_Xy is making a lot of design choices so that the first time user does not have to worry about them:) Anyway, I really appreciate your feedback and I am definitely open for making things simpler and more transparent:)

hmm yes after thinking about it more I agree that raw_to_Xy should just act as a quick start to the deepdow library. Input features could vary by a lot depending on the network type as well what the user generally feels like puting in (eg for experimentation), from returns to raw prices, volume, technical indicators, or even news based data and so the data preprocessing step could be drastically different as well.

By the way I also have an EDIT 2 in my previous response that you might want to take a look ;)

Regarding the EDIT 2, I do see your point:) You are more than welcome to create a pull request with your suggested change:)