ZacKeskin/PyCausality

An Error Reported

Closed this issue · 8 comments

There was an error reported when I run "TE.nonlinear_TE(pdf_estimator = 'kernel', n_shuffles=100)".

self.add.results({'TE_XY': np.array(TEs)[:,0],

IndexError: too many indices for array

Thanks for raising; I'll look into it. From a complete first guess without investigating the codebase it seems the TEs list containing the TE values for each window isn't lining up with the results, which probably means they've failed to be captured correctly.

Can you provide more information on the data you were using, and particular the indexing/timeframes/windowing?

Edit: In fact - if possible - could you share code/data which reproduces the issue? Also please check if this is covered by another open issue, as the only times I've seen the add_results() method fail is when the same TranferEntropy object is used multiple times (which is a known issue with a simple work-around of just initialising multiple TransferEntropy objects).

Having the same issue. Here are the details:

from PyCausality.TransferEntropy import *

display(df.tail())

output:


close | volume | utime | returns | price | pcomplexity | pcxint | pcxema
-- | -- | -- | -- | -- | -- | -- | -- | --
10124.65 | 0.002593 | 1.569001e+09 | 0.59 | 10124.65 | NaN | 1.536162 | 1.466953
10124.41 | 0.003789 | 1.569001e+09 | -0.24 | 10124.41 | NaN | 1.531244 | 1.466953
10124.06 | 0.003873 | 1.569001e+09 | -0.35 | 10124.06 | NaN | 1.526326 | 1.466953
10124.16 | 0.421435 | 1.569001e+09 | 0.10 | 10124.16 | NaN | 1.521408 | 1.466953
NaN | NaN | NaN | NaN | NaN | 1.51649 | 1.516490 | 1.483832

Then:

TE = TransferEntropy(DF = df,
        endog = 'pcxema',     # Dependent Variable
        exog = 'close',      # Independent Variable
        lag = 2,
        window_size = {'MS': 6},
        window_stride = {'W': 2}
)

TE.nonlinear_TE(pdf_estimator='kernel', n_shuffles=10)

which gives:

<ipython-input-37-9b4629af6516> in <module>
     11 )
     12 
---> 13 TE.nonlinear_TE()
     14 
     15 ## Display TE_XY, TE_YX and significance values

~/miniconda3/lib/python3.6/site-packages/PyCausality/TransferEntropy.py in nonlinear_TE(self, df, pdf_estimator, bins, bandwidth, gridpoints, n_shuffles)
    436 
    437         ## Store Transfer Entropy from X(t)->Y(t) and from Y(t)->X(t)
--> 438         self.add_results({'TE_XY' : np.array(TEs)[:,0],
    439                           'TE_YX' : np.array(TEs)[:,1],
    440                           'p_value_XY' : None,

IndexError: too many indices for array

I am unable to replicate your error, (5 rows of data are not enough to perform the same analysis) but I can see a number of serious issues in your methodology which will invalidate the technique, and which probably explain the errors.

  • You still have NaN's in your data... You need to cleanse your data first
  • You are using non-stationary data... You need to apply some differencing as a minimum
  • You are passing the windowing parameters - these only work if your data is indexed in datetime (which yours is not)

Each of these will cause you real errors before you even begin to analyse the results and tweak your parameters to improve on these.

Please take a look at the attached. I believe I need to improve the Readme and documentation for the package as well.

PyCausality - Attempt at Issue Replication.pdf

Hey Zac.
Yep my data is indexed on time, each row is a second in a pandas datetime index (it didn't showed in the snippet above, sorry for that).
Screen Shot 2019-09-21 at 4 37 00 PM
On the other hand, I supposed NaNs may cause conflicts, so I fill nans with the mean, just to test it and see what happens. It didn't work either. Same error.
Agree with improving the Readme though. This seems to be a really useful library.
Oh and, I'm not quite sure I'm getting this: "You need to apply some differencing as a minimum"??

Okay thanks - filling with mean is not something I'd advise for nonstationary data, so you need to look at the stationarity first. If you look at the attached pdf I apply a differencing method to look at the change in price at each timestep, rather than the nominal price. This literally means creating a new series by taking the difference between S(t+1) and S(t) for all t. You might otherwise consider a percentage change, or a log-difference, or even a first or second-derivative approximation, but I don't think you can get results from the nominal data.

The importance is that in calculating the transfer entropy, the distribution of results is observed as a probability density function, estimated using a histogram, or kernel, or other nonparametric density estimator.

Try that. if you can replicate my pdf, just using your (cleansed, stationary) data, then you shouldn't see any errors; if you do then I'll probably need to see your code/data to debug it.

OK Zac, the differencing transformation fixed the error. Thanks.
For me (but maybe I'm wrong), since the entropy is a measure that could be computed on any signal, Transfer Entropy should also be computable to any couple (or set) of signals.

The importance is that in calculating the transfer entropy, the distribution of results is observed as a probability density function

Can you point me out some paper to understand this? Also, how can I interpret the results?
Screen Shot 2019-09-22 at 7 40 17 AM
Thanks a lot for your time.

I would need to check the maths again to be 100% sure; however I have only ever used stationary time series and the code is designed and tested against this data. The question you want to answer is 'how much does a change in A cause/predict a move in B?' which implies differenced data; if you use nominal you have lots of numbers around 10,000 for 'close' and lots of numbers around 0.92 for 'pcxema' but... 'so what?' right?

For interpretation, the TE is notoriously difficult to interpret (see https://link.springer.com/article/10.1140/epjb/e2002-00379-2) which is the purpose of the shuffles. The p-value corresponds to the traditional p-value assuming normally-distributed statistics in the atemporal case (shuffled realisations). I tend to use the z-score which (like p-value above) can be treated much like a typical z-score in a standard normal distribution; i.e. z>3 means roughly 99% of results will be less than this. I tend to think Z>5 implies a real coupling relation; you can tweak the parameters of the analysis (e.g. KDE bandwidth; windowing) to identify where the significance is greatest.

I also meant to add, the most relevant paper for this is my own: https://arxiv.org/abs/1906.05740

Great. Thank you man.