suhrig/arriba

Arriba and STARfuison align

Closed this issue · 7 comments

Hi,
The picture below illustrates the differences between the alignment results of Arriba (v2.4.0, STAR2.7.8a) and Starfusion (v1.10.0, STAR2.7.8a). In Starfusion's BAM file, there are numerous softclip reads that support the TIMM23--LINC00843 fusion events (chr10:51606988--chr10:51732772). It appears that the parameter settings used by Arriba's STAR alignment discarded many of these reads.
Could you provide me with some suggestions?

arriba code(use the raw run_arriba.sh script)

arriba_v2.4.0//run_arriba.sh ~/arriba_v2.4.0//database//hg19_GENCODE37lift37/STAR_index_hg19_GENCODE37lift37/ ~/arriba_v2.4.0//database//hg19_GENCODE37lift37/GENCODE37lift37.gtf ~/arriba_v2.4.0//database//hg19_GENCODE37lift37/hg19.fa ~/arriba_v2.4.0//database//blacklist_hg19_hs37d5_GRCh37_v2.4.0.tsv ~/database//known_fusions_hg19_hs37d5_GRCh37_v2.4.0.tsv ~/arriba_v2.4.0//database//protein_domains_hg19_hs37d5_GRCh37_v2.4.0.gff3 6 ~/trim/A2_1.trim.fastq.gz ~/trim/A2_2.trim.fastq.gz

STARfusion code

~/STAR-Fusion-v1.10.0/STAR-Fusion --left_fq ~/trim/A2_1.trim.fastq.gz --right_fq ~/trim/A2_2.trim.fastq.gz --genome_lib_dir ~/hg19_GENCODE37lift37/STARFusion/GRCh37_gencode_v19_CTAT_lib_Mar012021.source/ctat_genome_lib_build_dir --CPU 6 --STAR_PATH ~/STAR-2.7.8a/bin/Linux_x86_64_static/STAR --examine_coding_effect --tmpdir  ~/tmp/ --output_dir STARfusion

image

Best,
xiucz

Hi, could you kindly pick one read which is aligned by STAR-Fusion but not by Arriba and extract the paired-end reads from the Arriba BAM file using for example samtools view arriba_alignments.bam | grep -w READ_NAME? I'm curious if the reads are aligned at all. Moreover, if I have the read sequence, then I can try to figure out which parameters are responsible. Thanks!

Thanks for your reply. I grep one read here.
arriba

$ samtools view Aligned.sortedByCoord.out.bam|grep  V350121616L4C005R0660558367
V350121616L4C005R0660558367	99	chr10	51381825	255	22M2452N59M3295N69M	=	51387704	351037	GGTCCAAACCAGGAAATGTACAGATTTTGAATATGGTGACTAGGCAAGGGGCACTTTGGGCTAATACTCTAGGTTCTCTGGCGTTGCTCTATAGTGCATTTGGTGTCATCATTGAGAAAACACGAGGTGCAGAAGATGACCTTAACACAG	FFFFGGFF>FFF<FFFGFFBFDFFEFFFFFGG??FFFFGFFGFFFGFFDDFDFFFGFGFFFFGFGFFFGFCFFFGFG<GGFFEFFEFFGFAGFGFFFFGFGFFFGFFCFFDFFFFFFFFFFFFFEGFFFFGFFFFA=FF?FGGGFFFFGF	NH:i:1	HI:i:1	AS:i:295	nM:i:2	NM:i:2
V350121616L4C005R0660558367	147	chr10	51387704	255	60M345008N90M	=	51381825	-351037	AAGATGACCTTAACACAGTAGCAGCTGGAACCATGACAGGCATGTTGTATAAATGTACAGTGTCAGAGATGGCTTTGGATTCCCCGTTCTGTGTGCTGCTGTCTGGCTCCTGAACCCAGCTGTAGAGGTGTGTGTCAATCCCAACTGGTG	9@GGFGGFFBAFFFGFFEAFDEFFGGGGCFGCGGCFFFGFEFFFFFGGGCGGBGGEFEGFGG7FGFCGGDFFGDFEGFGGF<GFFFG>FFGGEEGGFGFFGEGFFFGFFGGGFFGFCGFGGGBGGGGGEFFGGGAFFFFFDGFFGGFEGF	NH:i:1	HI:i:1	AS:i:295	nM:i:2	NM:i:0

STARfusion

 $ samtools view Aligned.out.sort.bam |grep  V350121616L4C005R0660558367
V350121616L4C005R0660558367	355	chr10	51381825	3	22M2452N59M3295N69M	=	51387704	5939	GGTCCAAACCAGGAAATGTACAGATTTTGAATATGGTGACTAGGCAAGGGGCACTTTGGGCTAATACTCTAGGTTCTCTGGCGTTGCTCTATAGTGCATTTGGTGTCATCATTGAGAAAACACGAGGTGCAGAAGATGACCTTAACACAG	FFFFGGFF>FFF<FFFGFFBFDFFEFFFFFGG??FFFFGFFGFFFGFFDDFDFFFGFGFFFFGFGFFFGFCFFFGFG<GGFFEFFEFFGFAGFGFFFFGFGFFFGFFCFFDFFFFFFFFFFFFFEGFFFFGFFFFA=FF?FGGGFFFFGF	NH:i:2	HI:i:2	AS:i:207	nM:i:2	RG:Z:GRPundef	XS:A:+
V350121616L4C005R0660558367	403	chr10	51387704	3	60M90S	=	51381825	-5939	AAGATGACCTTAACACAGTAGCAGCTGGAACCATGACAGGCATGTTGTATAAATGTACAGTGTCAGAGATGGCTTTGGATTCCCCGTTCTGTGTGCTGCTGTCTGGCTCCTGAACCCAGCTGTAGAGGTGTGTGTCAATCCCAACTGGTG	9@GGFGGFFBAFFFGFFEAFDEFFGGGGCFGCGGCFFFGFEFFFFFGGGCGGBGGEFEGFGG7FGFCGGDFFGDFEGFGGF<GFFFG>FFGGEEGGFGFFGEGFFFGFFGGGFFGFCGFGGGBGGGGGEFFGGGAFFFFFDGFFGGFEGF	NH:i:2	HI:i:2	AS:i:207	nM:i:2	RG:Z:GRPundef	XS:A:+
V350121616L4C005R0660558367	163	chr10	51606988	3	90S60M	=	51607030	5938	CACCAGTTGGGATTGACACACACCTCTACAGCTGGGTTCAGGAGCCAGACAGCAGCACACAGAACGGGGAATCCAAAGCCATCTCTGACACTGTACATTTATACAACATGCCTGTCATGGTTCCAGCTGCTACTGTGTTAAGGTCATCTT	FGEFGGFFGDFFFFFAGGGFFEGGGGGBGGGFGCFGFFGGGFFGFFFGEGFFGFGGEEGGFF>GFFFG<FGGFGEFDGFFDGGCFGF7GGFGEFEGGBGGCGGGFFFFFEFGFFFCGGCGFCGGGGFFEDFAEFFGFFFABFFGGFGG@9	NH:i:2	HI:i:1	AS:i:207	nM:i:2	RG:Z:GRPundef	XS:A:-
V350121616L4C005R0660558367	83	chr10	51607030	3	69M3294N59M2452N22M	=	51606988	-5938	CTGTGTTAAGGTCATCTTCTGCACCTCGTGTTTTCTCAATGATGACACCAAATGCACTATAGAGCAACGCCAGAGAACCTAGAGTATTAGCCCAAAGTGCCCCTTGCCTAGTCACCATATTCAAAATCTGTACATTTCCTGGTTTGGACC	FGFFFFGGGF?FF=AFFFFGFFFFGEFFFFFFFFFFFFFDFFCFFGFFFGFGFFFFGFGAFGFFEFFEFFGG<GFGFFFCFGFFFGFGFFFFGFGFFFDFDDFFGFFFGFFGFFFF??GGFFFFFEFFDFBFFGFFF<FFF>FFGGFFFF	NH:i:2	HI:i:1	AS:i:207	nM:i:2	RG:Z:GRPundef	XS:A:-

And here is a zoom-in screenshot,
image
Best,
xiucz

Thanks for your reply, I grep one soft-clip read in STARfusion.
arriba

$samtools view Aligned.sortedByCoord.out.bam|grep  V350121616L4C005R0660558367
V350121616L4C005R0660558367	99	chr10	51381825	255	22M2452N59M3295N69M	=	51387704	351037	GGTCCAAACCAGGAAATGTACAGATTTTGAATATGGTGACTAGGCAAGGGGCACTTTGGGCTAATACTCTAGGTTCTCTGGCGTTGCTCTATAGTGCATTTGGTGTCATCATTGAGAAAACACGAGGTGCAGAAGATGACCTTAACACAG	FFFFGGFF>FFF<FFFGFFBFDFFEFFFFFGG??FFFFGFFGFFFGFFDDFDFFFGFGFFFFGFGFFFGFCFFFGFG<GGFFEFFEFFGFAGFGFFFFGFGFFFGFFCFFDFFFFFFFFFFFFFEGFFFFGFFFFA=FF?FGGGFFFFGF	NH:i:1	HI:i:1	AS:i:295	nM:i:2	NM:i:2
V350121616L4C005R0660558367	147	chr10	51387704	255	60M345008N90M	=	51381825	-351037	AAGATGACCTTAACACAGTAGCAGCTGGAACCATGACAGGCATGTTGTATAAATGTACAGTGTCAGAGATGGCTTTGGATTCCCCGTTCTGTGTGCTGCTGTCTGGCTCCTGAACCCAGCTGTAGAGGTGTGTGTCAATCCCAACTGGTG	9@GGFGGFFBAFFFGFFEAFDEFFGGGGCFGCGGCFFFGFEFFFFFGGGCGGBGGEFEGFGG7FGFCGGDFFGDFEGFGGF<GFFFG>FFGGEEGGFGFFGEGFFFGFFGGGFFGFCGFGGGBGGGGGEFFGGGAFFFFFDGFFGGFEGF	NH:i:1	HI:i:1	AS:i:295	nM:i:2	NM:i:0

STARfusion

$ samtools view Aligned.out.sort.bam |grep  V350121616L4C005R0660558367
V350121616L4C005R0660558367	355	chr10	51381825	3	22M2452N59M3295N69M	=	51387704	5939	GGTCCAAACCAGGAAATGTACAGATTTTGAATATGGTGACTAGGCAAGGGGCACTTTGGGCTAATACTCTAGGTTCTCTGGCGTTGCTCTATAGTGCATTTGGTGTCATCATTGAGAAAACACGAGGTGCAGAAGATGACCTTAACACAG	FFFFGGFF>FFF<FFFGFFBFDFFEFFFFFGG??FFFFGFFGFFFGFFDDFDFFFGFGFFFFGFGFFFGFCFFFGFG<GGFFEFFEFFGFAGFGFFFFGFGFFFGFFCFFDFFFFFFFFFFFFFEGFFFFGFFFFA=FF?FGGGFFFFGF	NH:i:2	HI:i:2	AS:i:207	nM:i:2	RG:Z:GRPundef	XS:A:+
V350121616L4C005R0660558367	403	chr10	51387704	3	60M90S	=	51381825	-5939	AAGATGACCTTAACACAGTAGCAGCTGGAACCATGACAGGCATGTTGTATAAATGTACAGTGTCAGAGATGGCTTTGGATTCCCCGTTCTGTGTGCTGCTGTCTGGCTCCTGAACCCAGCTGTAGAGGTGTGTGTCAATCCCAACTGGTG	9@GGFGGFFBAFFFGFFEAFDEFFGGGGCFGCGGCFFFGFEFFFFFGGGCGGBGGEFEGFGG7FGFCGGDFFGDFEGFGGF<GFFFG>FFGGEEGGFGFFGEGFFFGFFGGGFFGFCGFGGGBGGGGGEFFGGGAFFFFFDGFFGGFEGF	NH:i:2	HI:i:2	AS:i:207	nM:i:2	RG:Z:GRPundef	XS:A:+
V350121616L4C005R0660558367	163	chr10	51606988	3	90S60M	=	51607030	5938	CACCAGTTGGGATTGACACACACCTCTACAGCTGGGTTCAGGAGCCAGACAGCAGCACACAGAACGGGGAATCCAAAGCCATCTCTGACACTGTACATTTATACAACATGCCTGTCATGGTTCCAGCTGCTACTGTGTTAAGGTCATCTT	FGEFGGFFGDFFFFFAGGGFFEGGGGGBGGGFGCFGFFGGGFFGFFFGEGFFGFGGEEGGFF>GFFFG<FGGFGEFDGFFDGGCFGF7GGFGEFEGGBGGCGGGFFFFFEFGFFFCGGCGFCGGGGFFEDFAEFFGFFFABFFGGFGG@9	NH:i:2	HI:i:1	AS:i:207	nM:i:2	RG:Z:GRPundef	XS:A:-
V350121616L4C005R0660558367	83	chr10	51607030	3	69M3294N59M2452N22M	=	51606988	-5938	CTGTGTTAAGGTCATCTTCTGCACCTCGTGTTTTCTCAATGATGACACCAAATGCACTATAGAGCAACGCCAGAGAACCTAGAGTATTAGCCCAAAGTGCCCCTTGCCTAGTCACCATATTCAAAATCTGTACATTTCCTGGTTTGGACC	FGFFFFGGGF?FF=AFFFFGFFFFGEFFFFFFFFFFFFFDFFCFFGFFFGFGFFFFGFGAFGFFEFFEFFGG<GFGFFFCFGFFFGFGFFFFGFGFFFDFDDFFGFFFGFFGFFFF??GGFFFFFEFFDFBFFGFFF<FFF>FFGGFFFF	NH:i:2	HI:i:1	AS:i:207	nM:i:2	RG:Z:GRPundef	XS:A:-

Alignment stat:
image
And here is zoom-in screenshot for the select read:
image
Best,
xiucz

Thanks for providing the details.

Without any further evidence, I would argue that Arriba is right here. It finds a better alignment than STAR-Fusion. The alignment proposed by Arriba is linear and even matches an annotated transcript of a gene perfectly (namely, NR_158654.1 of gene TIMM23B-AGAP6). The alignment proposed by STAR-Fusion has the same number of mismatches, but it is chimeric and therefore has a poorer alignment score (which is why Arriba discards it).

The reason why STAR-Fusion prefers the chimeric alignment over the linear one is simply because it uses a very low value for --alignIntronMax (if I remember correctly 100kb). This value is not biologically meaningful, since many splicing events are larger than this. STAR-Fusion has to use such a low value, because otherwise it would not be able to detect fusions from focal deletions. Arriba does not have this limitation and can use larger values for --alignIntronMax. For this reason, it finds the linear alignment, which is most likely an effect of normal splicing.

Do you have any reason to believe that this fusion is real? For example, do you have structural variants from whole-genome sequencing data to confirm this fusion? If not, then I find it likely that the fusion prediction made by STAR-Fusion is wrong in this case.

Thank you, and I will provide further information if I have any.

In the end, I could not further validate this sample. However, the same fusion event was detected in the WBC of the patient, leading me to believe that STARfusion's fusion event is erroneous. Thank you for your assistance.

Thanks for your feedback!