COMBINE-lab/salmon

Salmon quant reports transcripts in BAM not found in FASTA, despite GFFREAD correction

majdabdul opened this issue · 2 comments

Hello, I hope you're well!

Context

  1. I performed short-read (Illumina, 150bp PE) and long-read (ONT, cDNA+PCR) RNA-sequencing on 12 human cell line samples (4 conditions, 3 biological replicates each).
  2. I aligned the short-reads to chm13v2.0_RefSeq_Liftoff_v5.1.gff3 using HISAT2, and the long reads to chm13v2.0.fa using Minimap2.
  3. I used StringTie hybrid assembly mode (--mix option) to generate GTFs using both technologies (to take advantage of structural information from long reads and accuracy from short reads).
  4. I then ran stringtie --merge to generate the transcriptome assembly GTF from the sample GTFs.
  5. I took this merged transcriptome and further annotated it using SQANTI3 (with chm13v2.0_RefSeq_Liftoff_v5.1.gtf and chm13v2.0.fa).
  6. I converted both the SQANTI-annotated and the StringTie GTFs into fasta as follows:
    gffread -w merged_transcriptome.fa -g chm13v2.0.fa merged_transcripts.gtf

Now I'm trying to quantify with Salmon.

Firstly, I'm not sure whether it's better to use the short or long reads here as input to Salmon, given that my goal is to identify short peptides (I have peptidomics data) derived from specific splicing events. You may or may not be able to help me with that, but if you have thoughts, I'd appreciate them! Anyway, I decided arbitrarily to use the long-read BAMs as input to Salmon.

Bug description

Secondly, as discussed a little in #104 , I keep running into:
Transcript NM_032515.5 appears in the reference but did not appear in the BAM
and
Transcript chr19 appeared in the BAM header, but was not in the provided FASTA file
(note here that it's an entire chromosome??? And these are the only "transcripts" that don't appear in the fasta- they're all just the chromosome names.)
This happened regardless of whether I used the Stringtie fasta or the SQNATI-annotated fasta.

This is the salmon command I had run:
$salmon quant --ont -t $transcriptome -l SF -a $bam -o $outdir/$name

As suggested, I used gffread to generate a new transcriptome fasta as follows:
gffread -w salmon_fix.fa -g chm13v2.0.fa chm13v2.0_RefSeq_Liftoff_v5.1.gff3

I reran the above salmon command using this new fasta file, but got the same error and warnings: there are transcripts in the BAM not in the fasta and vice versa. Again, the "transcripts" not in the fasta were chromosome names.

I also tried with the short read SAM files, and still got the same error.

I'm not sure how to fix either the warnings or the errors and would really appreciate your help.

Software information

I used salmon v1.10.0 (though I also tried v1.10.1). In the case of v1.10.0, I had downloaded the pre-compiled binary, and in the case of v1.10.1, the admins of the HPC cluster I used installed it- not sure how. Regardless, all the runs were on HPC clusters, which run on Linux CentOS (I use two different HPC clusters, depending on their availability).

Cluster1:

$ uname -a
Linux rescomp1.hpc.in.bmrc.ox.ac.uk 3.10.0-1160.92.1.el7.x86_64 #1 SMP Tue Jun 20 11:48:01 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
-bash: lsb_release: command not found

Cluster2:

$ uname -a
Linux htc-login02 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Wed Dec 22 13:25:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 8.5.2111
Release:        8.5.2111
Codename:       n/a

Please let me know if I need to provide any more information. Thank you so much!