raw_to_Xy producing weird results
turmeric-blend opened this issue · 5 comments
Hi, I am using a dataframe that looks something like this when printed in jupyter notebook:
which passed the assert checks and has
assert isinstance(df, pd.DataFrame) # True
assert isinstance(df.index, pd.DatetimeIndex) # True
assert isinstance(df.columns, pd.MultiIndex) # True
n_timesteps = len(df)
n_channels = len(df.columns.levels[1])
n_assets = len(df.columns.levels[0])
print(n_timesteps, n_channels, n_assets)
2188 timesteps, 5 channels, and 20 assets.
The dataframe has no nan values df.isnull().values.any() # False
.
However, after passing it through raw_to_Xy()
function,
lookback, gap, horizon = 5, 2, 4
X, timestamps, y, asset_names, indicators = raw_to_Xy(df, lookback=lookback,
gap=gap, freq='B',
horizon=horizon)
n_samples = n_timesteps - lookback - horizon - gap + 1
print('n_samples:', n_samples)
print('X.shape:', X.shape)
print('timestamps.shape:', timestamps.shape)
print('y.shape:', y.shape)
print('len(asset_names):', len(asset_names))
print('indicators', indicators)
returns
n_samples: 2178
X.shape: (2321, 5, 5, 19)
timestamps.shape: (2321,)
y.shape: (2321, 5, 4, 19)
len(asset_names): 19
indicators ['Close', 'High', 'Low', 'Open', 'Volume']
Two issues in what was returned is that:
- I somehow have more samples (2321 as shown in
X
,timestamps
, andy
variables) than timestamps (2188). Expected number of samples is 2178 as shown by then_samples
variable - It returns less assets (19) than given (20). See
len(asset_names)
Thank you for the question:)
- The
raw_to_Xy
resamples the data so that the specified frequencyfreq
is matched. That means that you probably have some business days missing in your original dataframe and the function forward fills them in the background. See
how the new index is created here:Line 257 in f123a82
With that being said, the variable n_samples
was the expected number of samples for the dataset in the documentation. However, it is not a general formula that will work for every dataset.
- My guess is that one of your assets did not make it through the below check
Line 263 in f123a82
That is, the channels of this asset contain at least one element equal to:
np.nan
(but you probably excluded this)- infinity
- a nonpositive number
The raw_to_Xy
is a one size fits all dataset preparation tool. I wrote it so that first time users do not have to spend too much time on data preparation. I would definitely encourage you to adjust it for your needs.
The raw_to_Xy resamples the data so that the specified frequency freq is matched. That means that you probably have some business days missing in your original dataframe and the function forward fills them in the background
Making sure the input has Monday to Friday filled using freq='B' although there might be a public holiday on that date is unusual for me as most repositories I come across regarding trading just use the dates where the market is open. Anyway just thought this was an interesting approach.
For this check is_valid = np.all(np.isfinite(new[a])) and np.all(new[a] > 0)
, it was this np.all(new[a] > 0)
check that was failing and because some of my trading volume is 0. I think it would be appropriate to make it np.all(new[a] >= 0)
instead as if no one trades a certain stock on that day then the trading volume would be 0.
On the other hand, maybe you wouldn't even need this np.all(new[a] > 0)
check as it is possible to have daily news sentiment score as an input channel(feature) as well, and its possible for those score to be less than 0.
My guess is that np.all(new[a] > 0)
was to make sure all prices are > 0 as it should be, but I think the probability of having a negative price is very low as most people download their price data from a source, which ultimately comes from the exchange anyway which are positive. Just food for thought :)
EDIT 1
I now realise that raw_to_Xy converts inputs to purely returns. does this mean the current setup of deepdow does not deal with other forms of features eg volume/daily news sentiment ?
EDIT 2
I have another suggestion on potentially making raw_to_Xy more useful, maybe this line
Line 267 in f123a82
that excludes those assets that does not meet the assert checks should come after
Line 269 in f123a82
and
Line 269 in f123a82
this way, if say it is known before hand that 'High' price has some missing/bad values for say asset A, and I set in the
included_indicators
to exclude 'High' as shown:
X, timestamps, y, asset_names, indicators = raw_to_Xy(df, lookback=lookback, gap=gap, freq='B', horizon=horizon, included_indicators=['Open', 'Low', 'Close'])
the returned X
and y
won't drop the entire asset A. This makes included_indicators
kind of having a dual purpose
- excluding indicators if not required
- excluding indicators if there are error known
same goes for included_assets
. Another food for thought :)
Regarding the timestamps, the motivation behind implementing it this way was that I was afraid that first time users would feed data with big gaps and irregularities along the time dimension. Creating the rolling window features this way would not really make much sense. If you are confident that your datetime index is the right one I encourage you to just drop this resampling in the code.
As you correctly pointed out, the raw_to_Xy
is converting absolute quantities in the initial DataFrame to relative
quantities (log(x_t / x_{t-1})
if use_log=True
). It does so for all channels. That is why the nonnegative condition is present in the code (division by zero is not great). So if in your case the starting indicators are:
- Open price
- High price
- Low price
- Close price
- Volume
They are converted to
- Open return
- High return
- Low return
- Close return
- Volume return (percent change)
If you wonder why this is done, then there are 2 reasons:
- The loss function machinery assumes that one of the channels of the target tensor
y
represents this single period price returns. You can use thereturns_channel
parameter present in the constructor of all relevant losses (see https://github.com/jankrepl/deepdow/blob/master/deepdow/losses.py) to tell the loss function what channel it is. The remaining channels do not matter for the computation of most final losses. - It is a common practice in a lot of time series applications to perform first order differencing to obtain a time series that is more stationary. Returns are definitely closer to being stationary then absolute quantities (prices, volumes). Note that the same argument can be also seen from the ML point of you - having standardized features is a good thing. Features made out of returns will be closer to being correctly standardized than features made out of prices.
With that being said, nothing prevents you from having channels with absolute values. To do this you would simply adjust raw_to_Xy
such that the return computation is skipped. Additionally, deepdow
is able to do standardization
(see the example https://deepdow.readthedocs.io/en/latest/auto_examples/end_to_end/getting_started.html#sphx-glr-auto-examples-end-to-end-getting-started-py) so you do not have to worry that much about gradient descent not working correctly.
As I said, the raw_to_Xy
is making a lot of design choices so that the first time user does not have to worry about them:) Anyway, I really appreciate your feedback and I am definitely open for making things simpler and more transparent:)
hmm yes after thinking about it more I agree that raw_to_Xy
should just act as a quick start to the deepdow
library. Input features could vary by a lot depending on the network type as well what the user generally feels like puting in (eg for experimentation), from returns to raw prices, volume, technical indicators, or even news based data and so the data preprocessing step could be drastically different as well.
By the way I also have an EDIT 2 in my previous response that you might want to take a look ;)
Regarding the EDIT 2, I do see your point:) You are more than welcome to create a pull request with your suggested change:)