jogonmar/MVBatch

Missing data imputation in MVBatch

Closed this issue · 3 comments

Note that we do not have any routine to impute missing data in model building, which is integrated into MVBatch.

  1. It might happen that we load a data set containing missing values, what might jeopardize the outcomes of the bath synchronization.
    Sol.:
    a) Use the PCA model building routine to impute missing data. Unfortunately, we cannot add lags to the data matrix because batch data are not synchronized, therefore, we would have to unfold the data array variable-wise, and use a considerable number of PCs (selected by ckf).
    b) Another alternative is to mention in the user manual that the user should treat missing values with Abel and Francisco's tool prior to loading the data set into MVBatch.
  2. If Multisynchro algorithm is used for synchronization, it might happen that there are incomplete batches, and then it will generate shorter warping profiles in comparison to those derived from complete batches. Hence, we will end up with a three-way array containing missing data, which must be taken into account when building and exploiting the model. Recall that every time we project a new sample, we build the model.
    Sol.: When calling the monitoring window, we detect whether the data array contains missing values, and if so, we impute them with the PCA model building routine, but this time, using a batch-wise calibration. The drawback is that we cannot rely on ctf to determine the optimal number of the PCs for missing data imputation due to the CUMPRESS downward trend.
    Please, let me know your opinion on this issue to continue.

What about missTSR3D? Is already integrated in the toolbox, in modelling, and has integrated capability for different unfoldings

missTSR3D can be used as well. My main concern is the major implications on imputing missing values when data are not synchronized and non-stationary behaviors are not removed. However, the procedure mentioned in my previous email is the only one we can use, as in the case when data are not equalized.

For now, I would suggest that I integrate the routine with some code in the monitoring window to handle missing values. Let’s see what to do and if we implement something for missing data imputation at early stages in the bilinear modeling cycle.

missTSR3D cannot be used in early modeling steps because batch data are not synchronized. To overcome this limitation, I have designed and implemented a user interface to enable missing data imputation batch-to-batch separately. This new module is invoked in case that missing values exist in one batch at least, which is called after the screening GUI and before the Alignment GUI.

Unfortunately, at this step of the bilinear modeling we can only use the within-run dynamics to impute missing values, which is not ideal. Another drawback of this strategy is that the non-stationary behavior of the process variables remains, and might affect the imputation quality. To mitigate this effect, it is recommended that the user adds lags to the two-way matrix (close to BW), or add a substantial number of PCs when a low number of lags are added.