memory issues
Yyeserin opened this issue · 6 comments
Hello,
Is there a way to calculate the ram requirement for the program? I run the program on a cluster. The jobs are killed because of high memory usage. E.g. my lastest job exceeded 256 GB RAM usage, and was killed. Is it normal?
Thanks,
Yeserin.
Generally, fastq-multx is fairly low memory. It streams the data so that it doesn't get accumulated in memory. The only thing that might consume memory is the number of barcodes. We use our job default of 32G memory for every run in our sequencing facility and we've never run out of memory, so I suspect something else is causing your memory consumption. Files are built up continually, so if your job ran for any chunk of time, you should have at least partial output. Do you have outputs with data in them? I'd suggest running some small files to get some outputs and confirm correctness of the parameters. Also, if you're running paired end, the 2 files must have the same number of lines (i.e. not individually quality filtered) and the pairs must be in the same order.
Hi again,
Thank you for the super quick response.
I have 30 barcodes, and paired-end data. I am using the code below.
./fastq-multx -B barcodes_i7_fastq-multx.tsv -m 1 -H input_R1.fastq input_R2.fastq -o output/%_R1.fastq -o output/%_R2.fastq
Yes, I get lots of output for each barcode. An example list of the output files is attached. I checked a few of them. They looked fine to me.
One issue can be that I concatenated the sequencing data. The sequencing company did the demultiplexing but not properly. So, I concatenated all the undetermined and demultiplexed data to do it myself. I will ask for the raw data and try again. Maybe something happens to the files during demultiplexing and concatenation.
I am trying 512 GB memory meanwhile.
Thank you again for your response. It was very helpful.
Best,
Yeserin.
Examples:
R1:
@A00605:504:H327YDSX5:2:2527:12391:12305 1:N:0:TGAGCTGT
ATAACAACACAAAAATACACTAAACAAAAAACAAACTATAAATACGGCTAAGCTACCGTATGCGTTGGTCATGTCAACCCACAGGACGCACACAAATTGCGGGAATAGAATTACATACATCAGATCACTGGAGGATAAACAAAATGTACGT
+
F,F,F:FF,,,FF:FFFF:FF::FFF:,FF,:,:,F,,:FF,F,:F,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,,FFFFFFFFFFFFFFFFFFFF:FFFFFFFFF:,FFFFFFFFFFFFFF:::FFF,
@A00605:504:H327YDSX5:2:1101:18927:3020 1:N:0:TGAGCTGT
GTAGGAGTCGCAAATAGTCTTACTTTTGTTCCTGAATAAGTTATGCAATCGGATAAATCATTTATG_TAAAATTAGAAATAGTAGGGGGCCTAGGATTGTTCCTTGTAGAACTATCTGAGTTTACAGTTGTATCCATACCAAAAATACCAAA
R2:
@A00605:504:H327YDSX5:2:2527:12391:12305 2:N:0:TGAGCTGT
ATAACCATAATTTGTGCCAATACGTACATTTTGTTTATCCTCCAGTGATCTGATGTATGTAATTCTATTCCCGCAATTTGTGTGCGTCCTGTGGGTTGACATGACCAACGCATACGGTAGCTTAGCCGGACTAACAGTGGCGTTTGTGTTA
+
:,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF,FFFFFFFFFFF,FFFFFFFFFFFF:FF
@A00605:504:H327YDSX5:2:1101:18927:3020 2:N:0:TGAGCTGT
CTAAAAACGCAAAATCATTTTGGTATTTTTGGTATGGATACAACTGTAAACTCAGATAGTTCTACAAGGAACAATCCTAGGCCCCCTACTATTTCTAATTTTACATAAATGATTTATCCGATTGCATAACTTATTCAGGAACAAAAGTAAG
I don't know what else you might be running in your analysis pipeline or how your cluster is set up, but needing in excess of 256G memory to run fastq-multx is not right. If you count the lines in all the output files (including unmatched), is the sum equal to the input? It could be that fastq-multx finished successfully and a subsequent step is consuming memory. I would suggest putting an echo in the script after the fastq-multx step to see if it finishes. One shot-in the dark you could try for a clue is to ssh to the node it ran on and run dmesg
and look at its tail. 90% of the time, there's nothing there associated with the job, but at the times it is, it can be very useful.
Oh wait. If you concatenated the data, you could have a line that has joined the last line of 1 file with the first line of the next file. I always check that when concatenating. That would definitely screw things up.
So see if the input files end with a hard return...
Hi again,
Thank you for your respone and thinking.
I don't need to concatenate anymore. Will try again.
Best,
Yeserin.