Questions about Index Size and Short Mode

Question

Questions about Index Size and Short Mode

Opened this issue 2 years ago · 0 comments

Hi @yfukasawa,
In the first place, thank you for developing LongQC.
I am currently testing the tool to understand all the parameters better and choose their optimal configuration. However, I have several questions about the Index Size and the Short Mode since my test results seem unclear.
I have used two public datasets for my tests: flnc.bam (PacBio, Transcriptomic, ~4 Gb) and pb.bam (Pacbio, Genomic, ~12 Gb).
These are the results of my tests:

Test 1 - flnc.bam

Command (Only modifying the index size on each iteration):

longQC.py sampleqc -o /tmp/results -x pb-hifi -n 10000 -p 8 -m 2 -i 1G -t /data/input/flnc.bam

Results

Metrics table

CPUs and Memory use over time

Index Size = 1G

Index Size = 8G

Test 2 - pb.bam

Command (This time I have modified both index size and short mode):

longQC.py sampleqc -o /tmp/results -x pb-sequel -n 10000 -p 8 -m 2 -i 1G -b /data/input/pb.bam

Results

Metrics table

CPUs and Memory use over time

Index Size = 1G

Index Size = 8G

Conclusions

There is a significant variation of results across index sizes.
However, non-sense reads fraction is quite similar between different runs.
I have not found a linear correlation between index size and the execution time.
Enabling the Short Mode tends to reduce the non-sense read fraction slightly.

According to the results, my questions are the next:

Since bigger index sizes are much more costly in terms of memory and time, what are the advantages of selecting a big index size (e.g., 8G) compared with a smaller one (e.g., 1G)?
In your opinion, what Index size value should provide the most accurate results?
In which cases is it advisable to activate the short mode?

Thanks!,
Adolfo