Long run time on small file.

Question

Long run time on small file.

Closed this issue 7 months ago · 8 comments

Hi.

I am running FADU to get gene-count data for a bacteria organism which fits the purpose of the program. There are 3 samples with 2 replicate each - 2 in lysogeny, 2 in soil condition, and 2 in vivo infection in mice. I've adopted a general RNA-Seq pipeline, created the bam files and also the bam index files. Size for each for respective files could be viewed at the image below. As seen in the different sizes of bam files, I've mapped back the reads of my organism of interest and hence the lysogeny and soil bam files are bigger compared to the in vivo infection files. However, I am having issues with the infection dataset as it has taken more than 37 days (still running) whereas the others took 5-7 days, except for one lysogeny file which took roughly 20-25 days.

The command for the run is as such:
julia --threads 12 fadu.jl -g "ncbi_annogesic_sorfcomplemented.alldataset.gff" -b "bpd286.infection/FCHNKJTBBXX_L3_HKRDBACujvPBABRAAPEI-infection_207_1.samtools.sorted.bycoordinates.bam" -o "bpd286.infection.fadu_10em.out" -s "no" -f "gene" -a "Name" --em_iterations 10

julia --threads 12 fadu.jl -g "ncbi_annogesic_sorfcomplemented.alldataset.gff" -b "bpd286.infection/FCHWJJHBBXX_combined_infection_1.samtools.sorted.bycoordinates.bam" -o "bpd286.infection.fadu_10em.out" -s "no" -f "gene" -a "Name" --em_iterations 10

Computational detail.
I am running the latest FADU version in conda environment on Ubuntu 22.04 OS with 128GB RAM with 32 Processors.
Despite setting for 12 threads, only 1 is being used for each run when I checked using htop

Answer 1 · 2024-01-24T19:51:29.000Z

Hello @EthanKhew,

Are you able to check the step that your infection dataset run is stalling on? While I believe the multithreading works on the "compute and process feature overlaps" step, it currently does not work on the EM-iterations step. Also, the time it takes to run is typically tied to the number of feature overlaps (BAM record to GFF feature), and for the EM-iteration step, the number of overlaps involving multimapped reads (NH tag in BAM record present).

Answer 2 · 2024-01-29T16:27:36.000Z

I was chatting with the PI on this project and we are having a similar issue to runtime speeds on some recent datasets. In our particular case, the PI thinks there are ORFs overlapping regions of high rRNA concentration that are causing problems with her current runs, though she says this may not be relevant to your runs. However she suggested that we should have an option where the coordinates of regions to exclude are given.

In any case, I am going to look into this

Answer 3 · 2024-01-30T05:34:51.000Z

Hi @adkinsrs

Answering your first question, it is stuck at the Now finding overlaps between alignment and annotation records... step. I've tried on both cases of EM-iterations of 1 and 10 but they both still take more than 30 days.

Regarding the second one, I agree too and don't think the RNA elements are causing the bottleneck problem since other runs with bigger files could be completed by 5-7 days with 10 EM-iterations parameter. I've checked my bam files and it doesn't seem to be faulty as well.

I will upload the necessary files used to run my analysis in a bit to you.

Answer 4 · 2024-01-30T12:27:08.000Z

Thanks for the info,

By the way the "finding overlaps" step occurs upstream of the EM-iterations step, so the number of iterations is completely independent of that step.

Answer 5 · 2024-02-01T07:19:34.000Z

Yes, I am aware of that. Still, it would be interesting to look into the issue that's causing such a long run at the overlaps finding step.

Answer 6 · 2024-02-01T21:04:32.000Z

Just came across this issue post from 2021 (BioJulia/XAM.jl#44 (comment)) saying that multithreading is not quite working on XAM.jl (the BAM/SAM reader). A while back, FADU was using BioAlignments.jl which had thread support, but the SAM/BAM functionality moved from there to XAM.jl which unknowingly killed the mulithreading capability in FADU. So this explains why the --threads option seems to not work as intended. This is something that needs to definitely get resolved. I'm going to try and contact the XAM.jl authors (or the right BioJulia people) to see if any progress has been made on this or can be made.

By the way, this thread might end up having some notes to myself that are relevant to the overarching topic at hand, as I discover things.

Answer 7 · 2024-02-02T07:01:33.000Z

Ah yesss. I came across this post last year too but I thought it had been resolved. If that's the case, then I will just conclude the issue is due to a bug from one of the packages/programs and proceed to close this issue. Thanks for the clarification @adkinsrs

Answer 8 · 2024-02-13T05:05:50.000Z

@EthanKhew I pushed some updates to the "master" branch that I believe substantially speed up FADU. Observed that if a feature has 100,000s of overlaps the processing slowdown was noticeable. Took a run that was being reported as having no progress after a month (on our grid), and it ran on my Macbook in ~90 minutes. So feel free to pull the latest code if you want to try or feel free to wait for a tagged release which I plan to do in a day or so