mirge3 killde

Question

mirge3 killde

Opened this issue 8 months ago · 5 comments

when I tried to process 37 fastq files, Collapsing time increased and miRge3 was killed when 20 files Collapsed. Could I process files separately? Or there is any other way to process multiple files?
this is my code:
miRge3.0 -s ls 2.CleanFq/*gz | sed ':label;N;s/\n/,/;b label' -lib ~/paper/reference/annotation/miRge -on human -db miRBase -o 3.Results/tmp -nmir -ai -cpu 2 -ie -gff -tcf -spl -NX

Answer 1 · 2024-01-10T09:53:27.000Z

In addition, what is the difference between processing multiple data separately and processing multiple data at once?

Answer 2 · 2024-01-10T16:33:52.000Z

Hi @voluptatis,

The process was killed due to load on memory. At each step of collapsing process, miRge3.0 combines the collapsed reads and read counts from each sample in a Pandas dataframe.

After miRge3.0 run you will get the miRNA counts and RPM values (expression matrix) for each sample. Now, if you run individual samples you will end up with individual expression matrices. However, if you combine more samples (lets say 10), you will get the expression matrix of all 10 samples in one file. That is the one advantage for the secondary analysis, however, you can combine individual samples later in excel file.

Also I don't get the sed command, I hope it is not interupting the run. Can you try one sample and let me know how it goes?

Thank you,
Arun.

Answer 3 · 2024-01-11T03:44:28.000Z

the result of sed command is the list of samples. if I want to analyse isomir and novel miRNA,Should I process multiple data to make them consistent？ ***@***.***

…

---- Replied Message ---- From Arun ***@***.***> Date 1/11/2024 00:34 To ***@***.***> Cc ***@***.***> , ***@***.***> Subject Re: [mhalushka/miRge3.0] mirge3 killde (Issue #90) Hi @voluptatis, The process was killed due to load on memory. At each step of collapsing process, miRge3.0 combines the collapsed reads and read counts from each sample in a Pandas dataframe. After miRge3.0 run you will get the miRNA counts and RPM values (expression matrix) for each sample. Now, if you run individual samples you will end up with individual expression matrices. However, if you combine more samples (lets say 10), you will get the expression matrix of all 10 samples in one file. That is the one advantage for the secondary analysis, however, you can combine individual samples later in excel file. Also I don't get the sed command, I hope it is not interupting the run. Can you try one sample and let me know how it goes? Thank you, Arun. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

Answer 4 · 2024-01-11T04:47:49.000Z

In addition, I have a suggestion that if we process many files at the same time, can you create an intermediate file of pandas after each file is processed, and then merge them, so that you can free up memory, and therefore actually increase efficiency? Or you can split up the steps and either end up running in one step or step by step. ***@***.***

…

---- Replied Message ---- From Arun ***@***.***> Date 1/11/2024 00:34 To ***@***.***> Cc ***@***.***> , ***@***.***> Subject Re: [mhalushka/miRge3.0] mirge3 killde (Issue #90) Hi @voluptatis, The process was killed due to load on memory. At each step of collapsing process, miRge3.0 combines the collapsed reads and read counts from each sample in a Pandas dataframe. After miRge3.0 run you will get the miRNA counts and RPM values (expression matrix) for each sample. Now, if you run individual samples you will end up with individual expression matrices. However, if you combine more samples (lets say 10), you will get the expression matrix of all 10 samples in one file. That is the one advantage for the secondary analysis, however, you can combine individual samples later in excel file. Also I don't get the sed command, I hope it is not interupting the run. Can you try one sample and let me know how it goes? Thank you, Arun. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

Answer 5 · 2024-01-11T22:54:36.000Z

Hi @voluptatis,

Running multiple samples depends on the systems memory. If you run individual samples and then merge them later, the results will still be consistent.

Thank you for the suggestion. Creating intermediate file is possible (like a pickle file object of the dataframe), but it takes the same amount of time when one wants to combine them all later and may fail because it exceeds the capasity of the RAM. This also reduces the speed of the software overall. However, I will keep this in mind and come up with an alternative in the future.

Thank you,
Arun.