merenlab/anvio

[FEATURE REQUEST] Profiling of secondary mapping and read with no sequence in bam files

Opened this issue · 17 comments

The need

I was profiling a bam file when I noticed that a massive amount of my reads were not used:

WARNING
====================================================
There were 553 reads present in the BAM file that did not end up being used by
the profiler, which corresponds to about 30.317982456140346% of all reads. Since
you don't seem to have used parameters such as `--contigs-of-interest` or
`--fetch-filter` anvi'o is Jon Snow here. There are other reasons why some of
your reads may end up not being considered by the profiler. For instance, if
pysam encounteres reads that it can't deal with (i.e., mapped reads in the BAM
file with no defined sequences, or read with defined sequences without mapping).
Another reason can be that you have used a new paramter in the profiler that
removes contigs or reads from consideration. Regardless, if you think this is
worth your attention, now you know.

That number of reads corresponded to reads with the secondary mapping flag and minimap2 does not report the sequence for secondary mapping by default (to save some space I assume). BUT I WANT MY SECONDARY MAPPING TO BE IN MY ANVI'O PROFILE.

The solution

First, I found this in bamops.py:

    def _fetch_iterator(self, bam, contig_name, start, end):
        """Uses standard pysam fetch iterator from AlignmentFile objects, ignores unmapped reads"""

        for read in bam.fetch_only(contig_name, start, end):
            if read.cigartuples is None or read.query_sequence is None:
                # This read either has no associated cigar string or no query sequence. If cigar
                # string is None, this means it did not align but is in the BAM file anyways, or the
                # mapping software decided it did not want to include a cigar string for this read.
                # The origin of why query sequence would be None is unkown.
                continue

            yield read

And removing the check for read.query_sequence was not a problem for the profiler. But I don't know if there are unforeseen consequences for removing that condition: does snv calling need the read sequence? (apparently not since it just worked, but some expert can comment on it).

The solution could be to profile everything in the bam that is given and does take a decision on behalf of the user, and then we would have a fetch filter for only primary to skip the secondary mapping.

Beneficiaries

Often, you probably don't want to keep secondary mapping, i.e. reads mapping a second (or more) time with a alignment quality as good or within a threshold of the primary mapping. They are going to inflate coverage with potentially non-specific read recruitment in the case where the secondary mapping has a lower alignment score to the primary. BUT, the user should be able to choose to keep them or not. They may not realize that there bam contains such secondary mapping of various quality to the primary alignment (looking at you minimap2 - default threshold is 80% of primary alignment score), but they should be able to keep them or not.

Well, the solution cannot simply be to remove the above condition about query_sequence. Rather we should use pysam to retrieve the read's sequence from the primary alignment.

I made a test case with a short contig that I duplicated and introduced some minor changes (to get SNVs and INDEL). Reads should map multiple times and we will have secondary mapping in the resulting bam.

The test case consists of a CONTIGS.db and the mapping.bam (and bam.bai). Mapping done with minimap2 default parameters.
secondary_mapping.tar.gz

meren commented

Florian, based on how you designed the test dataset, is this what you would expect to see for coverage values in anvi'o default:

image
image

meren commented

Well. I found a way to track primary alignments (it is very ugly and memory intensive), but when I do that, this is what I see from the profiling results (you will notice the change if you look very carefully):

image

image

What do you see in IGV for this?

meren commented

Just for posterity, this is what does this:

diff --git a/anvio/bamops.py b/anvio/bamops.py
index d70d046ae..39de3ac64 100644
--- a/anvio/bamops.py
+++ b/anvio/bamops.py
@@ -225,6 +225,13 @@ class BAMFileObject(pysam.AlignmentFile):

         pysam.AlignmentFile.__init__(self)

+        # Fetch sequences for all primary alignments in the BAM file
+        self.primary_alignments = {}
+        for read in self.fetch():
+            # If this read is a primary alignment, store it in the dictionary
+            if not read.is_secondary:
+                self.primary_alignments[read.query_name] = read.query_sequence
+

     def fetch_only(self, contig_name, start=None, end=None, *args, **kwargs):
         """A wrapper function for `bam.fetch()`.
@@ -249,6 +256,9 @@ class BAMFileObject(pysam.AlignmentFile):
         """

         for read in self.fetch_only(contig_name, start, end, *args, **kwargs):
+            if read.is_secondary:
+                read.query_sequence = self.primary_alignments[read.query_name]
+
             if read.is_unmapped or read.cigartuples is None or read.query_sequence is None:
                 # This read either has no associated cigar string or no query sequence. If cigar
                 # string is None, this means it did not align but is in the BAM file anyways, or the

I can see that is the only solution and it will be very intensive. Especially for deeply sequenced short-reads (for which this is useless, we don't allow for secondary in short-reads mapping, usually).

Your solution worked for the SNV ( I don't see the INDEL though, should be the dip around 13kbp in the second contig).
But it didn't work for the coverage, still looks like only the primary.

With the secondary, you should expect a 100x mean coverage for both contigs, here in IGV:
image
image
(gray reads are read with only a primary mapping; white one are primary mapping but read was also mapped somewhere else; and black are secondary mapping)(is it actually interesting to see any grey read, all primary have a secondary in this test, but not all primary are flagged as having a secondary)

Wild guess: should we modify fetch_only and not fetch_and_trim?

Also, functions likes get_short_reads_dict will contains duplicated reads if secondary mapping was present during the profiling. Will impact anvi-get-short-reads* type of command line.

meren commented

We can add a new flag to anvi-profile and other relevant programs to --track-secondary-alignments, and only then engage all the intensive stuff. But I still don't get why secondary alignments didn't go into the right place in when we could map them with my temporary solution :/

meren commented

Never mind. It needed just a little more work. Here it is with my latest changes:

image image
meren commented

There is a memory happy way to solve this, and I will do that along with a flag. but now I want to commit these changes and see HOW MUCH LONGER do they take as you try to re-run the profiler for your figures, @FlorianTrigodet?

meren commented

Because the memory happy way will take even longer than that.

I don't see the INDEL in contig 2 ~13kbp, any idea why?
Otherwise yes, commit these changes and I'll run my profiles and give any feedbacks.

meren commented

I've been trying to figure that out, but I think it will require me much more time than what I have at the moment. At least COVs / SNVs / SCVs / SAAVs will work.

I will start a branch and will ping you.

There is an issue when I run it with a large bam file:

Number of reads in the BAM file ...................: 2,047,324
Number of sequences in the contigs DB .............: 700
Number of contigs to be conisdered (after -M) .....: 699
Number of splits ..................................: 3,688
Number of nucleotides .............................: 74,239,530

✖ anvi-profile encountered an error after 0:04:02.911939
Traceback (most recent call last):
  File "/user/patz5242/github/anvio/bin/anvi-profile", line 130, in <module>
    main(args)
  File "/user/patz5242/github/anvio/anvio/terminal.py", line 926, in wrapper
    program_method(*args, **kwargs)
  File "/user/patz5242/github/anvio/bin/anvi-profile", line 38, in main
    profiler.BAMProfiler(args)._run()
  File "/user/patz5242/github/anvio/anvio/profiler.py", line 591, in _run
    self.profile_multi_thread()
  File "/user/patz5242/github/anvio/anvio/profiler.py", line 1241, in profile_multi_thread
    raise worker_error
  File "/user/patz5242/github/anvio/anvio/profiler.py", line 1200, in profile_multi_thread
    raise contig
ValueError: cannot assign slice from input of different size

real	4m52.695s
user	23m5.452s
sys	2m42.201s
slurmstepd: error: Detected 15 oom_kill events in StepId=6456947.batch. Some of the step tasks have been OOM Killed.

And here is the memory usage for that job:

State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 20
CPU Utilized: 00:25:48
CPU Efficiency: 26.42% of 01:37:40 core-walltime
Job Wall-clock time: 00:04:53
Memory Utilized: 468.19 GB
Memory Efficiency: 398.52% of 117.48 GB

While I COULD increase the memory, I don't think that's the solution here.

meren commented

I didn't have a chance to work on my solution to make it memory friendly yet. It will be there hopefully today perhaps.

meren commented

OK. A new PR at #2371 solves the memory issues. But it comes with the following note:

In its current state, the PR is partially complete as there is a known bug in it that in some cases leads to the following error when profiling SNVs:

Traceback (most recent call last):
  File "/Users/meren/github/anvio/bin/anvi-profile", line 130, in <module>
    main(args)
  File "/Users/meren/github/anvio/anvio/terminal.py", line 926, in wrapper
    program_method(*args, **kwargs)
  File "/Users/meren/github/anvio/bin/anvi-profile", line 38, in main
    profiler.BAMProfiler(args)._run()
  File "/Users/meren/github/anvio/anvio/profiler.py", line 593, in _run
    self.profile_single_thread()
  File "/Users/meren/github/anvio/anvio/profiler.py", line 1066, in profile_single_thread
    contig = self.process_contig(bam_file, contig_name, contig_length)
  File "/Users/meren/github/anvio/anvio/profiler.py", line 1019, in process_contig
    split.auxiliary.process(bam_file)
  File "/Users/meren/github/anvio/anvio/contigops.py", line 205, in process
    self.run_SNVs_and_indels(bam)
  File "/Users/meren/github/anvio/anvio/contigops.py", line 485, in run_SNVs_and_indels
    aligned_sequence_as_ord, reference_positions = read.get_aligned_sequence_and_reference_positions()
  File "/Users/meren/github/anvio/anvio/bamops.py", line 557, in get_aligned_sequence_and_reference_positions
    return _get_aligned_sequence_and_reference_positions(
ValueError: cannot assign slice from input of different size

Solution for this requires insights into whether we should carry over more information into the read instance for the secondary alignment for the primary alignment when we update the 'sequence' information at these locations:

        (...)
        for read in self.fetch(contig_name, start, end):
            if self.secondary_alignments_tracked and read.is_secondary:
                read.query_sequence = self.primary_alignments[read.query_name]
        (...)

We update query sequence, but probably we need to update one or more things to address the SNVs error. Until then, using the flag --skip-SNV-profiling along with --track-secondary-alignments is the only option to avoid catastrophic failures during the profiling of secondary alignments.