gpertea/stringtie

one StringTie gene/loci (MSTRG.#) matching multiple reference gene names

Closed this issue · 2 comments

Hi:
I am trying to analyse differential gene expression of zea mays with well annotated genome following the protocol described by Pertea et al., 2016 (doi:10.1038/nprot.2016.095). Now one issue is that one StringTie gene/loci (MSTRG.#) matching multiple reference gene names(official ref_gene_id) as following:
qq 20180315135358
Because downstream analysis will be based on official ref_gene_id, StringTie internal gene id with more ref_gene_ids will make confused.
As is known, StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus. The flowchart of the protocol described by Pertea et al., 2016 (doi:10.1038/nprot.2016.095) is as following :

qq 20180315134000
whether this flowchart include the optional de novo assembly step ? and leading to the issue that one StringTie gene/loci (MSTRG.#) matching multiple reference gene names ? If so, what if analysing differential gene expression following the protocol "An alternate, faster differential expression analysis workflow can be pursued if there is no interest in novel isoforms (i.e. assembled transcripts present in the samples but missing from the reference annotation), or if only a well known set of transcripts of interest are targeted by the analysis. This simplified protocol has only 3 steps (depicted below) as it bypasses the individual assembly of each RNA-Seq sample and the "transcript merge" step. "? avoid the case that One internally generated locus IDs like MSTRG.18605 with several gene names ?
qq 20180315143804

Not sure if I understood the question completely but I think the answer is yes, if you run the alternate, faster protocol which does not attempt to assemble any novel genes, the output will only have expression level values for the known transcripts given in the reference annotation, and no merge step is needed, so there will be no internally generated IDs like MSTRG.# or STRG.# that could possibly join multiple reference gene IDs.
So if you really do not care about any novel transcripts/isoforms and trust the completeness and correctness of your reference annotation, give this alternate protocol a try and see if the output is now better suited for your project. You could then use the prepDE script on the GTF files generated by the these stringtie runs (if you prefer using DESeq2 or other DE analysis tools instead of Ballgown).

thank you for your timely reply, it works.