pagnani/PlmDCA.jl

Memory usage

Closed this issue · 7 comments

Hello,

Thanks a lot for the amazing easy to install software, I got it running basically on the first try. One quick question/issue:

Running plmDCA on a MSA (N= 1786, M= 8500) ran the first step fine (calculating pl's), but then an "OutOfMemory"-Error occurred. I fear, that this is due to the enormous size of my alignment/protein and that I'm running it on a common laptop. To mitigate the error: Can I estimate the memory requirements, or is there a way of distributing the calculations, such that these type of errors cannot occur?

Kind regards and thanks a lot in advance,

Jakob

Hi Jakob

I tried to run plmDCA on a randomly generated dataset

julia> N= 1786; M= 8500; Z = rand(1:21,N,M); W = ones(M)/M;
julia> plmdca(Z,W)
site = 746	 pl = 1.0779	 time = 239.7688	exit status = FTOL_REACHED
site = 448	 pl = 1.0798	 time = 260.3371	exit status = FTOL_REACHED
site = 1639	 pl = 1.0802	 time = 278.4344	exit status = FTOL_REACHED
site = 1044	 pl = 1.0783	 time = 279.5671	exit status = FTOL_REACHED
site = 1	 pl = 1.0788	 time = 279.8741	exit status = FTOL_REACHED
site = 299	 pl = 1.0787	 time = 288.6054	exit status = FTOL_REACHED
site = 1342	 pl = 1.0775	 time = 289.1239	exit status = FTOL_REACHED
site = 895	 pl = 1.0798	 time = 289.7156	exit status = FTOL_REACHED
site = 1491	 pl = 1.0777	 time = 289.8558	exit status = FTOL_REACHED

on my macbook pro (16Gb of RAM). It's slow but it is running (it occupies 12.7 Gb of RAM).

So it means that you find a computer with at least 13 Gb of RAM available for your process .

Other distribution strategies are not possible with the present code.

Closing as to actionable

Hey Andrea,

Perfect, thanks a lot for the estimate and quick reply, this helped a lot. This means that as long as I stay within these boundaries, calculations like this should be fine. I was worried, that some extensive amounts of RAM were required. I'll manage my available memory a little bit more carefully then and it should be fine.

Just for confirmation: I think the error occurred exactly after finishing the pI's, did you run this long, too?

Kind regards,

Jakob

Hi Jakob

Just for confirmation: I think the error occurred exactly after finishing the pI's, did you run this
long, too?

I just ran the snippet I sent you. The "pi" computation is indeed taken care of by DCAUtils.jl. Indeed, it might be that it requires more memory to run. The part is in here

You could try running the code with

ardca("file.fasta", theta=0) 

which should avoid the sequence reweighting step, and possibly reduce the memory required.

Hi Jakob
It's not entirely clear what your problem is now. Before it was the pre-processing step to compute the weights (removed possibly setting theta=0).

For such large datasets, I advise splitting the problem into two parts:

  1. Read the fasta file to generate the alignment and the weights (Z, W)
  2. running plmdca using the method plmDCA(Z,W)

In case the pre-processing is not the problem, then consider that:

  1. the memory of the model is $L(L-1)/2 * q^2 + L * q$ where $L$ is the number of residues in the alignment, and $q=21$ for proteins.
  2. There is an $M$ (number of sequences) dependence only in storing the dataset, but the memory is typically dominated by the model's parameters.

A simple computation shows that fixing L=2000 q = 21 the model should be ~ 14 Gb.
Indeed the number of parameters is $L^2 * q^2 + L*q$

I tried the following

julia> L=2000; q=21; M=2_000;
julia> Z = rand((Int8(1):Int8(21)),L,M); W = ones(M)/M;
julia> res = plmdca(Z,W);

By inspecting the running memory I get ~ 15 Gb of memory consumption (a bit above my estimation).