salmon quantmerge skipped the nucleotide IDs that have multiple sequences - Metagenome dataset
Opened this issue · 2 comments
Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
The issue existed in both bulk and single-cell mode
Describe the bug
When using Salmon to quantify non-redundant (NR) genes in metagenomic datasets, the generated output is missing a summary for nucleotide IDs that correspond to multiple sequences.
To Reproduce
Steps and data to reproduce the behavior:
- Merging quantifications with Salmon:
salmon quantmerge
--quants temp/salmon/L1EHI0900465--Q_S1_N6.quant
-o result/salmon/gene_L1EHI0900465--Q_S1_N6.TPM - Searching for a specific gene ID in the quantification file:
grep "k141_1346622_1" temp/salmon/L1EHI0900465--Q_S1_N6.quant/quant.sf
Multiple lines are found for this gene ID
- Searching for the same gene ID in the resulting TPM file:
grep "k141_1346622_1" result/salmon/gene_L1EHI0900465--Q_S1_N6.TPM
#No results are found, which is unexpected
Specifically, please provide at least the following information:
- Which version of salmon was used? salmon 1.4.0
- How was salmon installed (compiled, downloaded executable, through bioconda)? conda install salmon -y
- Which reference (e.g. transcriptome) was used? metagenome data
- Which read files were used? L1EHI0900465--Q_S1_N6.quant/
- Which which program options were used?
salmon quantmerge
--quants temp/salmon/L1EHI0900465--Q_S1_N6.quant
-o result/salmon/gene_L1EHI0900465--Q_S1_N6.TPM
Expected behavior
A clear and concise description of what you expected to happen.
I hope to keep all the gene IDs and for those who contains more than one line, take average values for each gene ID.
Screenshots
If applicable, add screenshots or terminal output to help explain your problem.
Desktop (please complete the following information):
- OS: [e.g. Ubuntu Linux, OSX]
- Version [ If you are on OSX, the output of
sw_vers
. If you are on linux the output ofuname -a
andlsb_release -a
]
Additional context
Add any other context about the problem here.
Updated Expected behavior:
A clear and concise description of what you expected to happen.
I aim to retain all gene IDs, and for those represented by multiple lines, I intend to calculate the sum of values for each unique gene ID.
I came across a few posts regarding this issue, but have not found a good solution for salmon quantmerge yet
Year 2018, in issue #214 (#214), --keepDuplicates was suggested for dealing with transcript duplicates. https://combine-lab.github.io/salmon/faq/ also mentioned "If you really want to go through with quantification of sequence duplicates. You can pass --keepDuplicates to the salmon indexing command. This will tell salmon not to discard these duplicates, and they will appear in the output quantifications." But from my understanding, this is for sequence-indentical duplicate, but for our case, the sequences and sequences' full annotations are different, but the shortened gene ID before "#" can be identical for multiple sequences.
e.g.,
k97_3_1 # 1 # 534 # 1 # ID=2_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.672
k97_3_1
- After salmon quant step, the gene_ID will be shorted but all will be keeped even though same gene_ID have different lengths etc
Name Length EffectiveLength TPM NumReads
k97_3_1 534 216.520 0.000000 0.000
k97_5_1 384 99.234 0.000000 0.000
k97_6_1 333 73.044 0.000000 0.000
k97_9_1 387 101.041 0.000000 0.000 - however, at salmon quantmerge step, the gene_ID with multiple sequences are removed.
Name NP1.clean.quant
k141_743617_3 0
k141_742060_5 0
k141_910930_3 0.015907
k141_1078715_3 0
k141_527785_4 0
This will cause the whole dataset lose the most information gene information, since those genes with multiple sequences may play an important biological roles. So I think i need to take some actions to keep all the genes by relabeling those who have multiple sequences by order them. Not sure whether this is something I can do through salmon quantmerge.