SugiharaLab/rEDM

version 1.15 dealing with NaN

ecosan327 opened this issue · 4 comments

Hello!

I have a few questions related to the missing data or NaN in the EDM functions like Simplex, CCM and S-map.
I understand that for Takens theorem to work, the continuity of the data is important for reconstructing the shadow manifold.
But unfortunately, my data have some gaps/missing data points between trials. Therefore, it will be helpful to know how could I avoid this problem in rEDM.

Questions:
(1)The note from rEDM version 1.15 mentioned that:
"SMap() ignoreNan parameter added. If ignoreNan is TRUE (default) the library is redefined to ignore embedding vectors with nan.
If ignoreNan is FALSE no change is made, the user can manually specify library segments in lib."

I also found a code note from rEDM version 1.2.3 mentioned:
"Missing data can be recorded using either of the standard NA or NaN values. The program will automatically ignore such missing values when appropriate. For instance, simplex projection will not select nearest neighbors if any of the state vector coordinates is missing or if the corresponding target value is missing."

I am wondering is the S-map ignoreNan function from version 1.15 is doing the same way as the version 1.2.3 did? Just not selecting the nearest neighbors if any of the state vector coordinates is missing or if the corresponding target value is missing?

(2) Does rEDM version 1.15 also ignore NaN (like the version 1.2.3) for Simplex, EmbedDimension and CCM?

Perhaps a little background for context could be helpful.

In Simplex, and it's derivatives EmbedDimension, CCM... nan in the data are passed through the entire | embedding : knn : projection | pipeline, and as such, any nan in the data are automatically rendered in the library, excluded in prediction, and properly represented in the output.

SMap embeds the data, then creates a linear system matrix solved with a LAPACK/BLAS SVD. LAPACK does not allow nan. In versions 1.14 and earlier, time series rows that contained nan were removed prior the SVD. This effectively prevented any library vectors with nan, but also created gaps in the output and raises the question of whether Takens embedding remains theoretically valid.

S-map ingoreNan in version 1.15 is new, adjusting the library to ignore all embedding vectors with nan. This should properly represent the output with nan as appropriate, rather than the previous method that returned gaps in the output.

So the answer to the first question is no, version 1.15 SMap does not handle nan in the same way as versions 1.14 and earlier.

Answer to the second question is yes, Simplex based functions ignore nan. However, not by redefining the library. Since the numerical computations are all internal and nan are carried though | embedding : knn : projection | any projection influenced by a nan will return nan.


On a related note, one can also consider handling missing data with "bundle embedding"
An empirical dynamic modeling framework for missing or irregular samples

Thank you so much for the clarification of how different versions do in Simplex and in S-map.
Could you explain in more details about how the ingoreNan function adjust the library?
I am curious how to adjust the library in the state space(change the shadow manifold?) in order to cope with the gap issue.
I think this is interesting and really helpful.=)

It is a bit complex since E, Tp, tau all influence the availability of valid embedding vectors in response to a NaN. Simply, when a NaN is present no prediction should be made with a library vector that has a NaN neighbor (a function of E, tau) or where Tp would include a vector with a NaN component. Recall that projections are made by taking neighbors projected Tp time steps ahead (behind) in Simplex, while all neighbors are used in SMap.

Perhaps some examples can illustrate.

Insert a Nan into observation x[10]

library( rEDM )
df = circle
dim( df )
[1] 200   3

head( df, 2 )
  Time      x     y
1    1 0.0000 1.000
2    2 0.0631 0.998

df $ x[10] = NaN
df[ 8:12, ]
   Time      x      y
8     8 0.4278 0.9039
9     9 0.4840 0.8751
10   10    NaN 0.8428
11   11 0.5903 0.8072
12   12 0.6401 0.7683

Simplex

Simplex prediction with E=2, Tp=1 and library including NaN observation.

Note Time 11 & 12 do not have a prediction, since Tp = 1, E = 2. The prediction at Time 9 is likely from a neighbor that included a component of the NaN in it's embedding vector.

> Simplex( dataFrame = df, lib = '1 50', pred = '5 15',
           columns = 'x', target = 'x', E = 2, Tp = 1 )
   Time Observations Predictions Pred_Variance
1     5       0.2499         NaN           NaN
2     6       0.3105      0.2451      0.011215
3     7       0.3699      0.3056      0.010833
4     8       0.4278      0.3648      0.010375
5     9       0.4840         NaN           NaN
6    10          NaN      0.4183      0.002957
7    11       0.5903         NaN           NaN
8    12       0.6401         NaN           NaN
9    13       0.6873      0.6162      0.008243
10   14       0.7318      0.6816      0.006449
11   15       0.7733      0.7260      0.005701
12   16       0.8118      0.7676      0.004952

In the case of Tp = -1, we expect Time 9 & 10 to not have a prediction with E = 2:

> Simplex( dataFrame = df, lib = '1 50', pred = '5 15',
           columns = 'x', target = 'x', E = 2, Tp = -1 )
   Time Observations Predictions Pred_Variance
1     4       0.1883      0.2034     0.0030564
2     5       0.2499      0.2648     0.0029724
3     6       0.3105      0.3251     0.0028645
4     7       0.3699      0.3841     0.0027358
5     8       0.4278      0.4500     0.0040664
6     9       0.4840         NaN           NaN
7    10          NaN         NaN           NaN
8    11       0.5903      0.6460     0.0003792
9    12       0.6401      0.6518     0.0018654
10   13       0.6873      0.6983     0.0016638
11   14       0.7318      0.7420     0.0014626
12   15       0.7733         NaN           NaN

Perhaps this is clearer in the case where the observation (target) does not have a NaN, but the library still does, here we use target = 'y' and see no predictions at Time 11 & 12:

Simplex( dataFrame = df, lib = '1 50', pred = '5 15',
         columns = 'x', target = 'y', E = 2, Tp = 1 )
   Time Observations Predictions Pred_Variance
1     5       0.9683         NaN           NaN
2     6       0.9506      0.3156        0.8291
3     7       0.9291      0.3042        0.8033
4     8       0.9039      0.2916        0.7716
5     9       0.8751      0.2779        0.7345
6    10       0.8428     -0.2784        0.7446
7    11       0.8072         NaN           NaN
8    12       0.7683         NaN           NaN
9    13       0.7264     -0.2757        0.5360
10   14       0.6815      0.1936        0.4916
11   15       0.6340      0.1740        0.4369
12   16       0.5839      0.1539        0.3822

SMap

SMap is a bit different since all library vectors are processed (but localized with theta), and the SVD solver does not allow NaN. ignoreNaN (default TRUE) redefines the library to exclude appropriate vectors (gaps) in library according to E, Tp, tau.

The cross mapping example with SMap (columns = 'x', target = 'y')

> SMap( dataFrame = df, lib = '1 50', pred = '5 15',
        columns = 'x', target = 'y', theta = 2, E = 2, Tp = 1 ) [['predictions']]
   Time Observations Predictions Pred_Variance
1     5       0.9683         NaN           NaN
2     6       0.9506      0.9506        1.9172
3     7       0.9291      0.9289        1.9033
4     8       0.9039      0.9044        1.8924
5     9       0.8751      0.8750        1.8894
6    10       0.8428      0.8428        1.7217
7    11       0.8072         NaN           NaN
8    12       0.7683         NaN           NaN
9    13       0.7264      0.7270        1.3985
10   14       0.6815      0.6811        1.1857
11   15       0.6340      0.6346        1.0133
12   16       0.5839      0.5829        0.8618

Prior to version 1.15 and ignoreNaN, one could achieve a similar result by explicitly specifying validLib to exclude NaN.

Create a validLib vector. Recall df $ x[10] is nan, so the initial validLib has FALSE in row 10. Add FALSE to row 11 for the E = 2, Tp = 1 example:

> validLib = !is.nan(df $ x)
> validLib[11] = FALSE
> validLib[5:15]
[1] 1 1 1 1 1 0 0 1 1 1 1

Now using validLib = validLib, ignoreNan = FALSE:

> SMap( dataFrame = df, lib = '1 50', pred = '5 15', 
        columns = 'x', target = 'y', theta = 2, E = 2, Tp = 1, 
        validLib = validLib, ignoreNan = FALSE ) [['predictions']]
   Time Observations Predictions Pred_Variance
1     5       0.9683         NaN           NaN
2     6       0.9506      0.9506        1.7717
3     7       0.9291      0.9289        1.7709
4     8       0.9039      0.9044        1.7573
5     9       0.8751      0.8750        1.7315
6    10       0.8428      0.8428        1.7055
7    11       0.8072         NaN           NaN
8    12       0.7683         NaN           NaN
9    13       0.7264      0.7270        1.3448
10   14       0.6815      0.6811        1.1575
11   15       0.6340      0.6346        0.9989
12   16       0.5839      0.5829        0.8551

Whereas if one uses ignoreNan = FALSE with no validLib, all predictions are NaN since all neighbors (library vectors) are used which include the embedding vectors from the NaN in row 10.

SMap( dataFrame = df, lib = '1 50', pred = '5 15',
      columns = 'x', target = 'y', theta = 2, E = 2, Tp = 1,
      ignoreNan = FALSE ) [['predictions']]
   Time Observations Predictions Pred_Variance
1     5       0.9683         NaN           NaN
2     6       0.9506         NaN           NaN
3     7       0.9291         NaN           NaN
4     8       0.9039         NaN           NaN
5     9       0.8751         NaN           NaN
6    10       0.8428         NaN           NaN
7    11       0.8072         NaN           NaN
8    12       0.7683         NaN           NaN
9    13       0.7264         NaN           NaN
10   14       0.6815         NaN           NaN
11   15       0.6340         NaN           NaN
12   16       0.5839         NaN           NaN

For peek under-the-hood, the code that actually creates the library vector is here:
https://github.com/SugiharaLab/cppEDM/blob/c41f7f5b16d3b13895523f0ae4b541b45babdcb2/src/Parameter.cc#L394

While the SMap code to adjust lib if NaN are found is here:
https://github.com/SugiharaLab/cppEDM/blob/c41f7f5b16d3b13895523f0ae4b541b45babdcb2/src/API.cc#L449


Perhaps the SMap issued warning "Time delay embedding presumption violated." is a bit extreme, as it is not absolute whether-or-not the embedding violates Takens presumption for a specific prediction.

Thank you so much, especially the examples!