Is_annotated always 3 on negative strand
perllb opened this issue · 5 comments
- spladder version:v3.0.4
- Python version:v3.8
- Operating System:Ubuntu
Description
I observe in several samples that all exon-skip events on negative strand are "is_annotated" = 3.
While events on positive strand has all "is_annotated" values.
This seems a bit suspicious?
- A general request - there is no description of what the "is_annotated" value actually means in the documentation (at least I cannot find it).
What I Did
- Run spladder
spladder build \
--bams ${bam} \
--annotation ref.gtf \
--outdir \$(pwd) \
--verbose \
--parallel ${task.cpus} \
--readlen 150 \
--confidence 3 \
--event-types exon_skip,intron_retention,alt_3prime,alt_5prime,mult_exon_skip,mutex_exons \
--ignore-mismatches
- Analyse is_annotation for pr strand
> spladder = pd.read_csv(f"merge_graphs_exon_skip_C3.confirmed.txt.gz",sep="\t")
> Counter(list(spladder[spladder["strand"] == "+"]["is_annotated"]))
Counter({1: 1245, 3: 5038, 2: 2195, 0: 727})
> Counter(list(spladder[spladder["strand"] == "-"]["is_annotated"]))
Counter({3: 3192})
Same pattern for 5 different samples.
Was wondering about the same thing. Have you found any explanation for this?
I have not.
Taking a look at the code.
spladder/classes/event.py
:
def set_annotation_flag(self, anno_introns):
self.sort_exons()
### check annotation status of isoform 1
self.annotated = 3
for i in range(self.exons1.shape[0] - 1):
if not (self.exons1[i, 1], self.exons1[i + 1, 0]) in anno_introns:
self.annotated -= 1
break
### check annotation status of isoform 2
if self.exons2.shape[0] < 2:
return
for i in range(self.exons2.shape[0] - 1):
if not (self.exons2[i, 1], self.exons2[i + 1, 0]) in anno_introns:
self.annotated -= 2
break
So it sets is_annotated = 3
by default.
Then if the given intron (defined by the neighbouring exon-junctions) is NOT in "anno_introns", then it will alter the is_annotated accordingly. That is, each exon-exon junction of the isoform must be in the annotation gtf for the is_annotated to stay "3".
I do not see how this can fail for only genes on negative strand?
Other options:
- Could the
self.exons1/2.shape[0]
return 0 ? This will basically skip the for-loop and thus not alter the is_annotated value.
I suspect the issue is "more serious" than the assignment of the is_annotated
value itself.
The total number of exon-skip events detected on negative strand is much lower than on positive, suggesting that the problem is not in assigning the is_annotated
value, but rather that the actual detection of negative strand events have issues:
> spladder = pd.read_csv(f"merge_graphs_exon_skip_C3.confirmed.txt.gz",sep="\t")
> Counter(list(spladder[spladder["strand"] == "+"]["is_annotated"]))
Counter({1: 1245, 3: 5038, 2: 2195, 0: 727})
> Counter(list(spladder[spladder["strand"] == "-"]["is_annotated"]))
Counter({3: 3192})
Seems like running with STAR-mapped bam causes this issue.
While running with HISAT2-mapped bam does not.
This issue seems to be solved by adding the following parameters to STAR:
--outSAMstrandField intronMotif \
--outSAMattributes NH HI NM MD AS XS \
STAR manual: https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf
For unstranded RNA-seq data, Cufflinks/Cuffdiff require spliced alignments with XS strand attribute,
which STAR will generate with --outSAMstrandField intronMotif option. As required, the XS
strand attribute will be generated for all alignments that contain splice junctions. The spliced
alignments that have undefined strand (i.e. containing only non-canonical unannotated junctions)
will be suppressed.