Understanding novel mode outputs
Closed this issue · 2 comments
I'm trying to better understand the output from the novel
mode. The description from the usage states:
The output is a GTF file with a list of genes/isoforms and their relative abundance.
However, while it appears to somewhat follow GTF conventions, I'm not sure how to interpret it.
Example output:
head HCC1395-Pt01-ar-711.liqa_novel_isos
GeneName IsoformName ReadPerGene_corrected RelativeAbundance infor_ratio
AC090197.1 chr8 + 23336170 23366125 ENST00000519692.1,ENST00000517420.1,
AC090197.1 chr8 + 23336170 23336206 0,1,
AC090197.1 chr8 + 23336207 23337457 1,1,
AC090197.1 chr8 + 23341025 23341149 1,1,
AC090197.1 chr8 + 23341150 23341536 0,1,
AC090197.1 chr8 + 23363125 23366125 1,0,
LINC02774 chr1 + 243917401 244047317 ENST00000440494.1,ENST00000652928.1,
LINC02774 chr1 + 243917401 243917449 1,0,
LINC02774 chr1 + 243917450 243917526 1,1,
Let's consider the AC090197.1
portion specifically -- it's clear the first line is a "header" line for the gene which lists the gene, chromosome, strand, the start position of all possible splice variants, the end position of all possible splice variants, and the canonical transcripts that are observed with these splice variants.
So, for AC090197.1 chr8 + 23336170 23336206 0,1,
, I believe it's stating that there is an included region at chr8:23336170-23336206
that is observed in ENST00000517420.1
, but not ENST00000519692.1
.
My questions:
- Am I interpreting the output correctly?
- How would situations of missing sequence (e.g. skipped exons) be represented in this format?
Actually, it appears the contents of the novel
mode are the contents of the refgene
reference with the header from the quantify
mode:
[spvensko@c6145-2-8 d93dc5a2a20de44d772037ba700622]$ head *refgene*
AC090197.1 chr8 + 23336170 23366125 ENST00000519692.1,ENST00000517420.1,
AC090197.1 chr8 + 23336170 23336206 0,1,
AC090197.1 chr8 + 23336207 23337457 1,1,
AC090197.1 chr8 + 23341025 23341149 1,1,
AC090197.1 chr8 + 23341150 23341536 0,1,
AC090197.1 chr8 + 23363125 23366125 1,0,
LINC02774 chr1 + 243917401 244047317 ENST00000440494.1,ENST00000652928.1,
LINC02774 chr1 + 243917401 243917449 1,0,
LINC02774 chr1 + 243917450 243917526 1,1,
LINC02774 chr1 + 243978409 243978519 1,1,
[spvensko@c6145-2-8 d93dc5a2a20de44d772037ba700622]$ head *novel*
GeneName IsoformName ReadPerGene_corrected RelativeAbundance infor_ratio
AC090197.1 chr8 + 23336170 23366125 ENST00000519692.1,ENST00000517420.1,
AC090197.1 chr8 + 23336170 23336206 0,1,
AC090197.1 chr8 + 23336207 23337457 1,1,
AC090197.1 chr8 + 23341025 23341149 1,1,
AC090197.1 chr8 + 23341150 23341536 0,1,
AC090197.1 chr8 + 23363125 23366125 1,0,
LINC02774 chr1 + 243917401 244047317 ENST00000440494.1,ENST00000652928.1,
LINC02774 chr1 + 243917401 243917449 1,0,
LINC02774 chr1 + 243917450 243917526 1,1,
The tool is still running, so perhaps the file will be further modified prior to completion?
Here's how I'm running novel
mode:
liqa -task novel -refgene gencode.v37.annotation.with.hervs.gtf.refgene -bam HCC1395-Pt01-ar-711.sorted.bam -out HCC1395-Pt01-ar-711.liqa_novel_isos -num_cover 20 -num_support_read 10
Yes, this is the correct interpretation of the results, and will address the issue with the header from issue #42 - thanks for pointing that out! The output should be different from the refgene file based on coverage/read parameters that you have provided. Are you still having an issue with that?