Problem with parallel = TRUE and large datasets, when running TCA
Closed this issue · 7 comments
I am having some difficulties when trying to make use of your implementation of the parallel R Package, when running TCA, using a multi-core machine in a large dataset (N > 1400). The code seems to get stuck in the iteration process for this large dataset, only when setting parallel = TRUE, but not when setting parallel = FALSE. Could this be an error in the code? Or perhaps I might be missing something. I seem to be able to run it smoothly in smaller datasets, setting either parallel = TRUE or parallel = FALSE.
Thanks in advance!
Can you please attach the exact command you're running before it gets stuck and the output log file after setting debug = TRUE ?
The fact that you can run the same code on smaller data implies that you are using insufficient memory resources in your execution (although the log error does not reflect that).
Note that the TCA model assumes independence between features (sites). You can therefore always break your data matrix into chunks, each having a different set of features (e.g., in the case of methylation you can break into chromosomes), and run TCA on each chuck - this will require much less memory for a given chunk.
One thing to keep in mind - breaking your data into chunks as described above makes sense only if you are not learning W (i.e. by setting refit_W = FALSE
). If refit_W = TRUE
then we create dependencies between the different features in the learning process and we therefore cannot break the data into chunks. We can overcome this if we first learn refit_W
from a small subset of features that are informative with respect to W (e.g., in the case of methylation, we can typically use a couple of hundreds of features that are highly informative for learning the cell-type composition W) - see parameters refit_W.features
and refit_W.sparsity
for more details. Given the new estimate of W, we can then break the data into chunks and use the new W while using refit_W = FALSE
for each chunk.
Dear Professor Rahmani,
Following your comment above, I'm attempting to write a wrapper to perform tca()
over chunks of the data matrix X (CpG sites). I noticed that in ./R/model_fit.R/tca.fit_means_vars
, the estimation of tau_hat
is using information from X. Hence, each fit of tca()
to a chunk of the data matrix X would return an estimate of tau_hat
. Does this mean there are dependencies between the sites induced by tau_hat
?
I used test_data
with test_data(30, 200000, 6, 1, 1, 0.01)
and tested between a sequential run and a parallel run by chunking X into 21 chunks. The parallel run returned 21 values of tau_hat
very close to the truth of 0.01
. Other than tau_hat
, the estimates are correlated > 0.999 between the two methods. However, I'm not too sure if I can chunk the data using real data.
Thank you so much and looking forward to hearing back from you
Sincerely,
Hung
tau models a component of iid variation that affects all CpGs in the data. The dependency in all features in the data is therefore coming from using all CpGs for estimating tau, however, we probably don't need a very large number of CpGs in order to estimate just a single variable, so in cases where splitting the data is needed for practical reasons I think it's fine to neglect this dependency. If you want to be extra cautious then I would randomly split the CpGs into chunks (rather than by chromosome num or by any other arbitrary criterion). You can also use the tau
argument to provide the models of all chunks with the same tau if you want to guarantee consistency between the models (and set it, for example, to tau_est from the first model) but I don't think this is necessary/better.
Dear Professor Rahmani,
I've written an initial wrapper for running tca()
over chunks of X at my fork of TCA. In combination with some optimizations, I achieved a 13.3x speed up with tca()
where the returned estimates have cor > 0.99 with those of a sequential run.
The optimizations are as follows. More details are in this report
- replacing
lm()
, andanova()
calls withRcppEigen::fastLm()
and a minimal implementation of the partial F-test.lm.fit()
can be used instead ofRcppEigen::fastLm()
if minimal dependencies are desired. - replacing
pracma::repmat()
calls withMESS::repmat()
calls.MESS::repmat()
is significantly faster and more resource efficient. This change adds up when the optimizations repeatedly callrepmat()
.
The wrappers are as follows. Details about behavior of tca_split()
is in this report
split_input()
takes in the X matrix and splits it into roughly equal chunks and shuffles of rows as you suggested. I also added aprint
method to prevent Rstudio from hanging by printing out list of huge matrices. Also added adim
method.tca_split()
which is a wrapper aroundtca()
but replacing the X input with the input fromsplit_input()
. The parallel back end is from the{future}
package to facilitate possible parallelization over multiple nodes on HPC with{future.batchtools}
. Can also change to{parallel}
.
Thank you so much for your guidance. Reading your codes has been a learning experience for me. I'm still relatively new to R and open-source software so please excuse any messiness. I can write PRs if needed since I also have most of the unit tests also written up.
Sincerely,
This sounds great! I'm traveling this week. Let me take a look into your fork when I'm back (next week).