ratschlab/spladder

Is_annotated always 3 on negative strand

perllb opened this issue · 5 comments

perllb commented
  • spladder version:v3.0.4
  • Python version:v3.8
  • Operating System:Ubuntu

Description

I observe in several samples that all exon-skip events on negative strand are "is_annotated" = 3.
While events on positive strand has all "is_annotated" values.
This seems a bit suspicious?

  • A general request - there is no description of what the "is_annotated" value actually means in the documentation (at least I cannot find it).

What I Did

  • Run spladder
spladder build \
      --bams ${bam} \
      --annotation  ref.gtf \
      --outdir \$(pwd) \
      --verbose \
      --parallel ${task.cpus} \
      --readlen 150 \
      --confidence 3  \
      --event-types exon_skip,intron_retention,alt_3prime,alt_5prime,mult_exon_skip,mutex_exons \
      --ignore-mismatches
  • Analyse is_annotation for pr strand
> spladder = pd.read_csv(f"merge_graphs_exon_skip_C3.confirmed.txt.gz",sep="\t")
> Counter(list(spladder[spladder["strand"] == "+"]["is_annotated"]))
Counter({1: 1245, 3: 5038, 2: 2195, 0: 727})

> Counter(list(spladder[spladder["strand"] == "-"]["is_annotated"]))
Counter({3: 3192})

Same pattern for 5 different samples.

riasc commented

Was wondering about the same thing. Have you found any explanation for this?

perllb commented

I have not.

Taking a look at the code.

spladder/classes/event.py:

def set_annotation_flag(self, anno_introns):

        self.sort_exons()
        
        ### check annotation status of isoform 1
        self.annotated = 3
        for i in range(self.exons1.shape[0] - 1):
            if not (self.exons1[i, 1], self.exons1[i + 1, 0]) in anno_introns:
                self.annotated -= 1
                break

        ### check annotation status of isoform 2
        if self.exons2.shape[0] < 2:
            return
        for i in range(self.exons2.shape[0] - 1):
            if not (self.exons2[i, 1], self.exons2[i + 1, 0]) in anno_introns:
                self.annotated -= 2
                break

So it sets is_annotated = 3 by default.
Then if the given intron (defined by the neighbouring exon-junctions) is NOT in "anno_introns", then it will alter the is_annotated accordingly. That is, each exon-exon junction of the isoform must be in the annotation gtf for the is_annotated to stay "3".

I do not see how this can fail for only genes on negative strand?

Other options:

  • Could the self.exons1/2.shape[0] return 0 ? This will basically skip the for-loop and thus not alter the is_annotated value.
perllb commented

I suspect the issue is "more serious" than the assignment of the is_annotated value itself.

The total number of exon-skip events detected on negative strand is much lower than on positive, suggesting that the problem is not in assigning the is_annotated value, but rather that the actual detection of negative strand events have issues:

> spladder = pd.read_csv(f"merge_graphs_exon_skip_C3.confirmed.txt.gz",sep="\t")
> Counter(list(spladder[spladder["strand"] == "+"]["is_annotated"]))
Counter({1: 1245, 3: 5038, 2: 2195, 0: 727})

> Counter(list(spladder[spladder["strand"] == "-"]["is_annotated"]))
Counter({3: 3192})
perllb commented

Seems like running with STAR-mapped bam causes this issue.
While running with HISAT2-mapped bam does not.

perllb commented

This issue seems to be solved by adding the following parameters to STAR:

   --outSAMstrandField intronMotif \
   --outSAMattributes NH HI NM MD AS XS \

STAR manual: https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf

For unstranded RNA-seq data, Cufflinks/Cuffdiff require spliced alignments with XS strand attribute,
which STAR will generate with --outSAMstrandField intronMotif option. As required, the XS
strand attribute will be generated for all alignments that contain splice junctions. The spliced
alignments that have undefined strand (i.e. containing only non-canonical unannotated junctions)
will be suppressed.