Memory usage

Question

Memory usage

Closed this issue 5 months ago · 7 comments

Hello,

Thanks a lot for the amazing easy to install software, I got it running basically on the first try. One quick question/issue:

Running plmDCA on a MSA (N= 1786, M= 8500) ran the first step fine (calculating pl's), but then an "OutOfMemory"-Error occurred. I fear, that this is due to the enormous size of my alignment/protein and that I'm running it on a common laptop. To mitigate the error: Can I estimate the memory requirements, or is there a way of distributing the calculations, such that these type of errors cannot occur?

Kind regards and thanks a lot in advance,

Jakob

Answer 1 · 2024-11-08T14:56:03.000Z

Hi Jakob

I tried to run plmDCA on a randomly generated dataset

julia> N= 1786; M= 8500; Z = rand(1:21,N,M); W = ones(M)/M;
julia> plmdca(Z,W)
site = 746	 pl = 1.0779	 time = 239.7688	exit status = FTOL_REACHED
site = 448	 pl = 1.0798	 time = 260.3371	exit status = FTOL_REACHED
site = 1639	 pl = 1.0802	 time = 278.4344	exit status = FTOL_REACHED
site = 1044	 pl = 1.0783	 time = 279.5671	exit status = FTOL_REACHED
site = 1	 pl = 1.0788	 time = 279.8741	exit status = FTOL_REACHED
site = 299	 pl = 1.0787	 time = 288.6054	exit status = FTOL_REACHED
site = 1342	 pl = 1.0775	 time = 289.1239	exit status = FTOL_REACHED
site = 895	 pl = 1.0798	 time = 289.7156	exit status = FTOL_REACHED
site = 1491	 pl = 1.0777	 time = 289.8558	exit status = FTOL_REACHED

on my macbook pro (16Gb of RAM). It's slow but it is running (it occupies 12.7 Gb of RAM).

So it means that you find a computer with at least 13 Gb of RAM available for your process .

Other distribution strategies are not possible with the present code.

Answer 2 · 2024-11-08T14:56:24.000Z

Closing as to actionable

Answer 3 · 2024-11-12T08:12:20.000Z

Hey Andrea,

Perfect, thanks a lot for the estimate and quick reply, this helped a lot. This means that as long as I stay within these boundaries, calculations like this should be fine. I was worried, that some extensive amounts of RAM were required. I'll manage my available memory a little bit more carefully then and it should be fine.

Just for confirmation: I think the error occurred exactly after finishing the pI's, did you run this long, too?

Kind regards,

Jakob

Answer 4 · 2024-11-12T10:50:46.000Z

Hi Jakob

Just for confirmation: I think the error occurred exactly after finishing the pI's, did you run this
long, too?

I just ran the snippet I sent you. The "pi" computation is indeed taken care of by DCAUtils.jl. Indeed, it might be that it requires more memory to run. The part is in here

You could try running the code with

ardca("file.fasta", theta=0)

which should avoid the sequence reweighting step, and possibly reduce the memory required.

Answer 5 · 2024-11-27T16:51:56.000Z

Hi Andrea, Thanks a lot again for the support, it was super helpful. In the end I got some of my datasets running (with sequence lengths of 800-1200). I noticed, that at the end of the pL calculations there are bursts in memory use, currently up to 10-12 GB. I now tested running a combined dataset of both sequences with ~2000 amino acids and here I unfortunately ran into troubles with the memory error again. To me it looks like the memory use scales with sequence length, question is only how strongly? Do you have any estimate how much GB RAM I would need for a sequence of length 2000? 2x or rather 4x or worse? Kind regards and thanks a lot again in advance, Jakob

…

________________________________ From: Andrea Pagnani ***@***.***> Sent: Tuesday, November 12, 2024 11:51 AM To: pagnani/PlmDCA.jl ***@***.***> Cc: JakobReber ***@***.***>; Author ***@***.***> Subject: Re: [pagnani/PlmDCA.jl] Memory usage (Issue #22) Hi Jakob Just for confirmation: I think the error occurred exactly after finishing the pI's, did you run this long, too? I just ran the snippet I sent you. The "pi" computation is indeed taken care of by DCAUtils.jl<https://github.com/carlobaldassi/DCAUtils.jl>. Indeed, it might be that it requires more memory to run. The part is in here<https://github.com/pagnani/PlmDCA.jl/blob/6d2f6d57f27b72f6f038ab775a5d8181c391828b/src/utils.jl#L51-L66> You could try running the code with ardca("file.fasta", theta=0) which should avoid the sequence reweighting step, and possibly reduce the memory required. — Reply to this email directly, view it on GitHub<#22 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BMYOMCLLTD5K2A2CJGD6GXL2AHMRXAVCNFSM6AAAAABRNHW4MCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZQGIYDOMJTGQ>. You are receiving this because you authored the thread.Message ID: ***@***.***>

Answer 6 · 2024-11-28T10:53:12.000Z

Hi Jakob
It's not entirely clear what your problem is now. Before it was the pre-processing step to compute the weights (removed possibly setting theta=0).

For such large datasets, I advise splitting the problem into two parts:

Read the fasta file to generate the alignment and the weights (Z, W)
running plmdca using the method plmDCA(Z,W)

In case the pre-processing is not the problem, then consider that:

the memory of the model is $L(L-1)/2 * q^2 + L * q$ where $L$ is the number of residues in the alignment, and $q=21$ for proteins.
There is an $M$ (number of sequences) dependence only in storing the dataset, but the memory is typically dominated by the model's parameters.

A simple computation shows that fixing L=2000 q = 21 the model should be ~ 14 Gb.
Indeed the number of parameters is $L^2 * q^2 + L*q$

I tried the following

julia> L=2000; q=21; M=2_000;
julia> Z = rand((Int8(1):Int8(21)),L,M); W = ones(M)/M;
julia> res = plmdca(Z,W);

By inspecting the running memory I get ~ 15 Gb of memory consumption (a bit above my estimation).

Answer 7 · 2024-11-29T12:23:25.000Z

Hi Andrea, Ok, perfect, thanks a lot! I think I'm just running into a soft cap with exactly my laptop. I'm neither a physicist/informatician, but I was worried, that the memory requirements scale more than quadratically (for the length) due to some additional calculations / matrices in the plmDCA. If it is only 15 GB, then I should get lucky if I just run it on some PC with >16GB... My apologies for the constant bothering. Your replies help a lot. Kind regards and have a nice weekend, Jakob

…

________________________________ From: Andrea Pagnani ***@***.***> Sent: Thursday, November 28, 2024 11:53 AM To: pagnani/PlmDCA.jl ***@***.***> Cc: JakobReber ***@***.***>; Author ***@***.***> Subject: Re: [pagnani/PlmDCA.jl] Memory usage (Issue #22) Hi Jakob It's not entirely clear what your problem is now. Before it was the pre-processing step to compute the weights (removed possibly setting theta=0). For such large datasets, I advise splitting the problem into two parts: 1. Read the fasta file to generate the alignment and the weights (Z, W) 2. running plmdca using the method plmDCA(Z,W) <https://github.com/pagnani/PlmDCA.jl/blob/6d2f6d57f27b72f6f038ab775a5d8181c391828b/src/plmdca_asym.jl#L28> In case the pre-processing is not the problem, then consider that: 1. the memory of the model is $L(L-1)/2 * q^2 + L * q$ where $L$ is the number of residues in the alignment, and $q=21$ for proteins. 2. There is an $M$ (number of sequences) dependence only in storing the dataset, but the memory is typically dominated by the model's parameters. A simple computation shows that fixing L=2000 q = 21 the model should be ~ 14 Gb. Indeed the number of parameters is $L^2 * q^2 + L*q$ I tried the following julia> L=2000; q=21; M=2_000; julia> Z = rand((Int8(1):Int8(21)),L,M); W = ones(M)/M; julia> res = plmdca(Z,W); By inspecting the running memory I get ~ 15 Gb of memory consumption (a bit above my estimation). — Reply to this email directly, view it on GitHub<#22 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BMYOMCPDCALSDPEVHV6LYLD2C3Y25AVCNFSM6AAAAABRNHW4MCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMBVHAZDMNRXGE>. You are receiving this because you authored the thread.Message ID: ***@***.***>