hputnam/Express_Compare

Stringtie outputs mainly STRG ids rather than gene ids

Opened this issue · 5 comments

Previously, our team was not able to successfully output a gene count matrix from Stringtie. I was able to run Stringtie on Andromeda and output a gene count matrix but many of the rownames are designated as STRG rather than a gene id name. This becomes difficult to perform GOSeq to determine gene function in a downstream analysis as GOSeq can only use gene ids from an annotation to assign functionality. This has been particularly problematic for the Pocillipora acuta dataset.

We have determined that STRG ids are generated by Stringtie in the case of novel transcripts or transcripts not aligning with the genome. This should not be occurring because we are using the following commands:

Reference annotation transcripts (-G)
A reference annotation file in GTF or GFF3 format can be provided to StringTie using the -G option which can be used as 'guides' for the assembly process and help improve the transcript structure recovery for those transcripts.

NOTE: we highly recommend that you provide annotation if you are analyzing a genome that is well annotated, such as human, mouse, or other model organisms.

Note that when a reference transcript is fully covered by input read alignments, the original transcript ID from the reference annotation file will be shown in StringTie's output file in the reference_id GTF attribute for that assembled transcript. Output transcripts lacking the reference_id attribute can be considered "novel" transcript structures with respect to the provided reference annotation.

Expression estimation mode (-e)
When the -e option is used, the reference annotation file -G is a required input and StringTie will not attempt to assemble the input read alignments but instead it will only estimate the expression levels of the "reference" transcripts provided in the -G file.

With this option, no "novel" transcript assemblies (isoforms) will be produced, and read alignments not overlapping any of the given reference transcripts will be ignored, which may provide a considerable speed boost when the given set of reference transcripts is limited to a set of target genes for example.

@laurenzane will rerun Stringtie with the -A <gene_abund.tab> option to better understand how many gene ids are being recovered from the RNA-seq dataset.

@laurenzane please add your code here that generated the output with STRG

@laurenzane also check out the following if you want non-redundant transcripts:

  1. Merge mode: Merged GTF
    If StringTie is run with the --merge option, it takes as input a list of GTF/GFF files and merges/assembles these transcripts into a non-redundant set of transcripts. This step creates a uniform set of transcripts for all samples to facilitate the downstream calculation of differentially expressed levels for all transcripts among the different experimental conditions. Output is a merged GTF file with all merged gene models, but without any numeric results on coverage, FPKM, and TPM. Then, with this merged GTF, StringTie can re-estimate abundances by running it again with the -e option on the original set of alignment files, as illustrated in the figure below.

This may also be of interest:

The R package IsoformSwitchAnalyzeR can be used to assign gene names to transcripts assembled by StringTie, which can be particularly helpful in cases where StringTie could not perform this assignment unambigiously.
The importIsoformExpression() + importRdata() function of the package can be used to import the expression and annotation data into R. During this import the package will attempt to clean up and recover isoform annotations where possible. From the resulting switchAnalyzeRlist object, IsoformSwitchAnalyzeR can detect isoform switches along with predicted functional consequences. The extractGeneExpression() function can be used to get a gene expression (read count) matrix for analysis with other tools.
More information and code examples can be found here.

@laurenzane please also include here or link to your counts matrix output so we can see both the code and the output that has already been generated

I am pretty sure you are just at the step before merge as shown here: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#output
Screen Shot 2022-07-27 at 11 32 24 AM

My thoughts:

If you want transcript level information and isoforms, you will need to get the parent information from the gtf, or blast again to obtain the GO information

If you want gene level information, you need to run the --merge option and you will only have gene names.