suhrig/arriba

Error in merging adjacent breakpoints?

karlestira opened this issue · 2 comments

Hello.

I run arriba in my sample and found it gives me results like this(I turned off the "select_best" filter because I need to know all possible fusion in a pair of gene, and the min support read is changed to 1. The cmd is like " -S 1 -f select_best". ):

#gene1 gene2 strand1(gene/fusion) strand2(gene/fusion) breakpoint1 breakpoint2 site1 site2 type split_reads1 split_reads2 discordant_mates
SQSTM1 ALK +/+ -/- chr5:179252226 chr2:29446394 CDS/splice-site CDS/splice-site translocation 9 25 28
SQSTM1 ALK +/+ -/- chr5:179253291 chr2:29446394 intron CDS/splice-site translocation 2 9 29
SQSTM1 ALK +/+ -/- chr5:179252225 chr2:29446394 CDS/splice-site CDS/splice-site translocation 3 1 28
SQSTM1 ALK +/+ -/- chr5:179252225 chr2:29446392 CDS/splice-site CDS/splice-site translocation 2 0 24
SQSTM1 ALK +/+ -/- chr5:179252226 chr2:29446392 CDS/splice-site CDS/splice-site translocation 0 1 24
SQSTM1 ALK +/+ -/- chr5:179251323 chr2:29446394 CDS/splice-site CDS/splice-site translocation 0 1 5

In this result, line 1, 3, 4, 5 is close to each other within 5bp, however the merge_adjacent filter cannot merge them correctly.

I check the code and found some confusing things in merge_adjacent_fusions.cpp:line 48 and line 65. It says:
((**previous_fusion).breakpoint2 == (**fusion).breakpoint2 + ((**fusion).breakpoint1 - (**previous_fusion).breakpoint1) * (((**fusion).direction1 == (**fusion).direction2) ? +1 : -1)

But what hell is this mean? Bkps will be merged only when distance between bkp2 and bkp1 are equal?
In STAR's result, this bkp's random shift within 1-5bp is very common, caused by seqence error, mismatches or so on, but I think we should not require the distance between bkp1 and bkp2 to be equally when merging adjacent breakpoints.
I also use STAR-FUSION to call fusions in the same bam file, and it says these 4 fusions should be merged together. However, STAR-FUSION cannot tell me which transcipt the fusion at, so I must use arriba.

I add follow code into this if:
// fusion in same direction and bkp1 close to each other and bkp2 close to each other (((**fusion).direction1 == (**fusion).direction2 && abs((**fusion).breakpoint2 - (**previous_fusion).breakpoint2) <= max_distance && abs((**fusion).breakpoint1 - (**previous_fusion).breakpoint1) <= max_distance) ||
// fusion in diff direction and this bkp1 close to that bkp2, this bkp2 close to that bkp1 ((**fusion).direction1 != (**fusion).direction2 && abs((**fusion).breakpoint2 - (**previous_fusion).breakpoint1) <= max_distance && abs((**fusion).breakpoint1 - (**previous_fusion).breakpoint2) <= max_distance)
And then run arriba in the same bam file, it gives result like this:

#gene1 gene2 strand1(gene/fusion) strand2(gene/fusion) breakpoint1 breakpoint2 site1 site2 type split_reads1 split_reads2 discordant_mates
SQSTM1 ALK +/+ -/- chr5:179252226 chr2:29446394 CDS/splice-site CDS/splice-site translocation 14 27 28
SQSTM1 ALK +/+ -/- chr5:179253291 chr2:29446394 intron CDS/splice-site translocation 2 9 29
SQSTM1 ALK +/+ -/- chr5:179251323 chr2:29446394 CDS/splice-site CDS/splice-site translocation 0 1 5

Line 1 is the merge result of the old line 1, 3, 4, 5, it becomes correctly.

However, I think I have not understand the the principle of code(is this change correctly?), so I need to report this question.

suhrig commented

But what hell is this mean? Bkps will be merged only when distance between bkp2 and bkp1 are equal?

It means that breakpoints will only be merged if they are shifted by the same number of bases. For example, let's say we have two pairs of breakpoints:

breakpoint1: chr1:1000, breakpoint2: chr2:1000
breakpoint1: chr1:1002, breakpoint2: chr2:998

In this case, the breakpoints will be merged, because the two extra bases that are aligned at breakpoint1 of the second breakpoint pair are subtracted from breakpoint2. This is a frequent occurrence, because when breakpoints have sequence homology multiple alternative alignments are possible all of which have the same alignment score. For example, when fusion breakpoints have two bases of sequence homology, the aligner (STAR) can place the two bases on either end of the breakpoints and both will have the same alignment score. In practice, STAR will choose them at random, giving rise to 50:50 ratio of one or the other alignment.

In contrast, the following example will not be merged:

breakpoint1: chr1:1000, breakpoint2: chr2:1000
breakpoint1: chr1:1002, breakpoint2: chr2:1000

This is because they are not equally scoring alternative alignments. These alignments must arise from different sequences and are not considered to originate from the same event by Arriba, hence.

In the past, the code used to be written as you suggest - merging anything within a given distance - regardless of whether the breakpoints arise from alternative alignments or whether they are unrelated alignments. In my experience, this often led to breakpoints being merged which do not belong to each other, which in turn led to a higher false positive rate, because the read numbers were inflated. This is why I changed the merging procedure to be as stringent as it is now.

Admittedly, this will occasionally lead to breakpoints not being merged even though they should as in your case. If you have the time, can you kindly extract the reads overlapping your fusion breakpoints and send them to me? I can gladly take a look how STAR chose to align these reads and why they are not alternative alignments. My guess is there are sequencing errors right at the fusion junction.

is this change correctly?

I think your adaptation is more complex than it has to be. I haven't tested this, but it should suffice to only compare breakpoint1 to breakpoint1 and breakpoint2 to breakpoint2 - regardless of the fusion direction, i.e.:

abs((**fusion).breakpoint2 - (**previous_fusion).breakpoint2) <= max_distance && abs((**fusion).breakpoint1 - (**previous_fusion).breakpoint1) <= max_distance)

breakpoint1 always contains the smaller coordinate and breakpoint2 always contains the bigger coordinate.

OK, thanks, but my demander says fusions too close to each other always need to be merged(to be more precise as they said, "filtered"). They think "the code used to be" is right.

My sequencing data was captured using specific probes and has UMI, resulting in a very high dup rate. And the markdup program is specially designed to meet some requirements(such as making consensus sequence for PCR duplications, very important for mutation recognition). All these things may lead to some additional problems and performed differently from conventional RNA sequencing data.

Anyway, thank you for your answer, maybe I need to merge STAR-FUSION's and arriba's result.

I will close this issues.