Doubts about running salmon after using STAR

Question

Doubts about running salmon after using STAR

Opened this issue 5 months ago · 0 comments

desmodus1984 commented 5 months ago

Hi,

I am trying to get TPM with salmon, and I feel lost and doubtful about if I am doing is the right thing.
I got RNA-seq data, and I got counts per read with STAR, hence mapped to the genome.
While reading about getting TPM, I can't get them from the counts, so I decided to try salmon.
It needs a transcripts.fa, which I am not sure how to properly get it.

The documentation says:
If you have reads that have already been aligned to the genome, there are currently 3 options for converting them for use with Salmon. First, you could convert the SAM/BAM file to a FAST{A/Q} file and then use the lightweight-alignment-based mode of Salmon described below.
But there is nothing below it.

Then, it states:
Second, given the converted FASTA{A/Q} file, you could re-align these converted reads directly to the transcripts with your favorite aligner and run Salmon in alignment-based mode as described above.
I can get the FastQ from the Bam created by STAR, then my question, is which mapper do you recommend? Can bwa-mem2 work well?
My question, is that I will nonetheless need a transcriptome file, and to index it.
I downloaded the genome from NCBI: https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/207650/100/GCF_011952255.1_Bvos_JDL3184-5_v1.1/
and I don't know which file should I use as the transcriptome; there are cds_from_genomic, protein, and rna. In one forum, one mentioned download the human transcriptome from ensemble, but it is not there.

Also, I found that one can generate it with gffread using the gtf file, so I wanted to confirm if that output file will work
gffread -w Bvostranscripts.fa -g BosVos.fasta GCF_011952255.1_Bvos_JDL3184-5_v1.1_genomic.gtf
because I remember reading somewhere that the file must have transcript length, but it doesn't:
head Bvostranscripts.fa

XM_033511117.1 CDS=100-1053
ctCCAACGAGAAAGACGTGCTCAGAAGCGAGTCCCGCGTAGCGCCATGTAGAGGAGCAGGAAGAAGCGCA
GTCTTCCTCAGGCATTCCAGCGTCGAGTAATGCTGAAGCATATCCTGACGGGGCAACCGTCGTCAGCGGG
CTCCAGGCACCAGCACAATGACTACCAGGCGAGCTCGGGCTCGTTGCACGGcggccaccaccaccaccat
cattCTGGTCACCAACAGGATCAACAGCAACACCATCATTACAACCAAAACAGCAACGTTCGTGGAACGA
CAGAGGGGAGCCAAGGACGGGGAGTCGACGTTTACGACTCGGTGGTCCATCAAAGAGCCAACGTGCATCA
GGCTGCCACCACGCCATCTTCCACTCACAGACACCCTTTGAATCATTCTCAGTTGAGCGTGAACAATCTT
TCTCAACGATTGAATCACTCGCACGCTCTTAATCTGTCCACGTTGTCCACGTCGAAGCATTCTGTGAACA
GCGTCAGTCCTGTTGCCGGTgggaataacaataacaataataataatctgtcgACTACATTGGGGGTGAT
ATCCCCGGCGCCGCTGCACCAGGACAGCAGACCTAAAGCGAATGGAGGCTTTGATATCTCGAGACTGTCC

And, lastly, I would appreciate if you could explain the purpose of doing the decoy.
I tried running the indexing, and it did work, but I am not whether with decoy would be better, in addition to the doubt of using the Bvostranscripts.fa file previously created.
salmon index -t Bvostranscripts.fa -i Bvos -p 10
Version Server Response: Not Found
index ["Bvos"] did not previously exist . . . creating it
[2024-08-08 13:22:24.073] [jLog] [warning] The salmon index is being built without any decoy sequences. It is recommended that decoy sequence (either computed auxiliary decoy sequence or the genome of the organism) be provided during indexing. Further details can be found at https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode.
....
[2024-08-08 13:22:40.806] [puff::index::jointLog] [info] chunk 4 = [13,537,944, 16,922,430)
[2024-08-08 13:22:40.806] [puff::index::jointLog] [info] chunk 5 = [16,922,430, 20,306,916)
[2024-08-08 13:22:40.806] [puff::index::jointLog] [info] chunk 6 = [20,306,916, 23,691,402)
[2024-08-08 13:22:40.806] [puff::index::jointLog] [info] chunk 7 = [23,691,402, 27,075,888)
[2024-08-08 13:22:40.806] [puff::index::jointLog] [info] chunk 8 = [27,075,888, 30,460,374)
[2024-08-08 13:22:40.806] [puff::index::jointLog] [info] chunk 9 = [30,460,374, 33,844,827)
[2024-08-08 13:22:41.347] [puff::index::jointLog] [info] finished populating pos vector
[2024-08-08 13:22:41.347] [puff::index::jointLog] [info] writing index components
[2024-08-08 13:22:41.425] [puff::index::jointLog] [info] finished writing dense pufferfish index
[2024-08-08 13:22:41.446] [jLog] [info] done building index

I checked the Bvostranscripts.fa for number of sequences, and it matched the number of "all transcripts" from NCBI.

Looking forward to your response and suggestions to get the TPM properly estimated.

Thank you very much;