arq5x/lumpy-sv

only use discordant reads

Yiguan opened this issue · 2 comments

Hi,

When a region having lots of discordant reads and a few split reads, I find the inferred breakpoints rely too much on the split reads.
For example,

3L      500033  2397    N       <INV>   550.60  .      SVTYPE=INV;SVLEN=1343;END=501376;STRANDS=++:38,--:30;IMPRECISE;CIPOS=-7,9;CIEND=-10,9;CIPOS95=-1,5;CIEND95=-3,1;SU=68;PE=67;SR=1

The region has 67 discordant reads and 1 split read. The inferred breakpoint 500033 is heavily relied on the one split read. But I tend to believe that the one split read is likely to be a mapping error or misalignment.

I am wondering if there is a way to remove the split read while only use the discordant reads to infer inversions. Or can I set a threshold for the minimum number of split reads? eg. when the number of split read > 5, then these split reads can be used.

Thanks,

The mapping quality of the read is 60, Cigar="49H41M".

Here is the IGV for the read (view as pair):

https://github.com/Yiguan/miscellaneous/blob/main/aa.png

In the screenshot, there are four sample tracks(three samples):

sample_95 discordant track
sample_95 split track
sample_88 discordant track
sample_62 discordant track

 sample_95  3L      500033  2397    N       <INV>   .       .       SVTYPE=INV;STRANDS=++:38,--:30;SVLEN=1343;END=501376;CIPOS=-7,9;CIEND=-10,9;CIPOS95=-1,5;CIEND95=-3,1;IMPRECISE;SU=68;PE=67;SR=1        GT:SU:PE:SR     ./.:68:67:1

 sample_88 3L      500334  2221    N       <INV>   .       .       SVTYPE=INV;STRANDS=++:23,--:40;SVLEN=771;END=501105;CIPOS=-174,9;CIEND=-40,9;CIPOS95=-122,0;CIEND95=-30,0;IMPRECISE;SU=63;PE=63;SR=0    GT:SU:PE:SR     ./.:63:63:0

 sample_62  3L      500306  2541    N       <INV>   .       .       SVTYPE=INV;STRANDS=++:21,--:22;SVLEN=838;END=501144;CIPOS=-178,9;CIEND=-143,11;CIPOS95=-126,0;CIEND95=-128,2;IMPRECISE;SU=43;PE=43;SR=0 GT:SU:PE:SR     ./.:43:43:0

The first breakpoint of the inversion in sample_95 was inferred in the "region1"(in the screenshot), while sample_88 and sample_62, the first breakpoint of the inversion were in "region2".
As the three samples are from the same family(siblings), I believe they should have the same inversion and "region1" is unlikely to be a true breakpoint.
Just confused why sample_95 having a breakpoint in "region1". I suspect that it may be caused by the split read(second track in the screenshot) as the read falls into the "region1".