Function of STAR parameter outSAMtype BAM Unsorted in Arriba run? Which GTF is suitable? How to remove gene fusion from the benign samples?
Opened this issue · 2 comments
Hi,
Thank you for developing such an amazing tool to identify gene fusion from RNASeq data. I have 30 cancer sample Toral RNASeq data and 10 Benign sample Total RNASeq data. My focus is to identify DEGs and gene fusion form these samples.
I have a few questions:
Q1: In the documentation it is mentioned that in the Arriba direct run, STAR will run using the following two parameters:
--outStd BAM_Unsorted --outSAMtype BAM Unsorted
My question is what will happen if I use --outSAMtype BAM SortedByCoordinate and --quantMode TranscriptomeSAM parameter while STAR run separately.
And if run STAR with --outStd BAM_Unsorted --outSAMtype BAM Unsorted these parameters, where there will be any issues in my downstream DEGs analysis?
Q2: I am using UCSC hg38 genome to align my data. In that case which GTF file I should? Will it be refGene GTF (UCSC) or GENCODE GTF? Since Arriba documentation is mentioning GENCODE GTF file?
Q3: What will be my pipeline if I want to identify fusion only in tumor specific condition (e.g. benign samples will be used as control to remove some of the common fusion)?
Thank you.
Regards,
Tanay
Q1: It makes no difference whether you use --outStd BAM_Unsorted --outSAMtype BAM Unsorted
or --outSAMtype BAM SortedByCoordinate
. Arriba doesn't care if the alignments are sorted or not. The documentation doesn't sort the alignments, because it's not needed for fusion calling and would thus be a waste of CPU. (To be precise, there may be very minor differences in the fusion calls when you use sorting, but they are not any more or less meaningful than when not sorting. So do whatever is best in your workflow.)
When you use TranscriptomeSAM
, make sure you pass the regular BAM file (containing genomic coordinates) to Arriba. Do not pass the file *.toTranscriptome.out.bam
to Arriba! Arriba needs genomic coordinates.
Q2: Feel free to use a UCSC GTF file if this is what you use usually. GENCODE works a bit better for fusion calling in my experience, because it has more detailed annotation. But if you normally use UCSC, then it would be complicated to compare Arriba's output based on GENCODE to your other results based on UCSC.
There is one very important exception: If your cancer samples are from hematologic malignancies, then you should use GENCODE, because USCS does not annotate the T-cell receptor loci. Arriba can only call fusions for annotated genes.
Q3: Run Arriba separately on both normal and malignant samples. Then, take all fusions found in the main output file (-o fusions.tsv
) and in the discarded output file (-O discarded_fusions.tsv
) of the control samples. Subtract these fusions from the malignant samples. The subtraction should be done using breakpoint coordinates (as listed in the columns breakpoint1/2
). Do not subtract fusions by gene name.
Let me know if anything is unclear.
Thank you for the response. My samples are not from any hematologic malignancies. Yeah, then I should use refGene file or if I really want to use GENCODE GTF then I should use Ensembl genome FASTA file instead of UCSC hg38 FASTA.
Thanks again!