Extracting genome structure [from Darth, or not]
rchikhi opened this issue · 18 comments
I think the file [accession].darth/[accession]/[accession].vadr.pass.tbl
gives the genome structure:
E.g.
>Feature ERR4238643.coronaspades.NODE_1_length_29820_cluster_1_candidate_1_domains_39
227 21516 gene
gene ORF1ab
227 13429 CDS
13429 21516
product ORF1ab polyprotein
exception ribosomal slippage
protein_id ERR4238643.coronaspades.NODE_1_length_29820_cluster_1_candidate_1_domains_39_1
227 13444 CDS
product ORF1a polyprotein
protein_id ERR4238643.coronaspades.NODE_1_length_29820_cluster_1_candidate_1_domains_39_2
227 766 mat_peptide
product leader protein
protein_id ERR4238643.coronaspades.NODE_1_length_29820_cluster_1_candidate_1_domains_39_1
227 766 mat_peptide
product leader protein
protein_id ERR4238643.coronaspades.NODE_1_length_29820_cluster_1_candidate_1_domains_39_2
767 2680 mat_peptide
product nsp2
protein_id ERR4238643.coronaspades.NODE_1_length_29820_cluster_1_candidate_1_domains_39_1
[..]
13429 16197
product RNA-dependent RNA polymerase
protein_id ERR4238643.coronaspades.NODE_1_length_29820_cluster_1_candidate_1_domains_39_1
13403 13441 mat_peptide
product nsp11
protein_id ERR4238643.coronaspades.NODE_1_length_29820_cluster_1_candidate_1_domains_39_2
[..]
This is what we'd like to plot, right?
This information is present in another format in the .vadr.ftr
files. For RdRP's, over all accessions, it looks like:
$ find | grep ftr$ | xargs grep RNA-dep
[..]
ERR3013320.coronaspades.NODE_1_length_28002_cluster_1_candidate_1_domains_38 28002 PASS NC_003436 mat_peptide RNA-dependent_RNA_polymerase:GBSEP:putative_coronavirus_nsp9_(RdRp) 2781 21 + 12571 15350 - no - - - - 2 0 12571..12594:+,12594..15350:+ 12597..12620:+,12620..15376:+ -
ERR3013321.coronaspades.NODE_1_length_28028_cluster_1_candidate_1_domains_37 28028 PASS NC_003436 mat_peptide RNA-dependent_RNA_polymerase:GBSEP:putative_coronavirus_nsp9_(RdRp) 2781 21 + 12592 15371 - no - - - - 2 0 12592..12615:+,12615..15371:+ 12597..12620:+,12620..15376:+ -
ERR3013323.coronaspades.NODE_1_length_28159_cluster_1_candidate_1_domains_38 28159 FAIL NC_003436 mat_peptide RNA-dependent_RNA_polymerase:GBSEP:putative_coronavirus_nsp9_(RdRp) 2781 21 + 12723 15502 - no - - - - 2 0 12723..12746:+,12746..15502:+ 12597..12620:+,12620..15376:+ -
ERR3013322.coronaspades.NODE_1_length_28076_cluster_1_candidate_1_domains_37 28076 FAIL NC_003436 mat_peptide RNA-dependent_RNA_polymerase:GBSEP:putative_coronavirus_nsp9_(RdRp) 2781 21 + 12636 15415 - no - - - - 2 0 12636..12659:+,12659..15415:+ 12597..12620:+,12620..15376:+ -
[..]
Emphasis on the PASS/FAIL column. @rcedgar , how many RdRP's did you end up considering? I get the following numbers:
$ find | grep ftr$ | xargs grep RNA-dep | grep PASS |wc -l
1288
$ find | grep ftr$ | xargs grep RNA-dep | grep FAIL |wc -l
2424
(I believe this includes multiple RdRP's per accession)
I got 2875 RdRPs in the last round where multiple RdRP's per assembly were discarded.
To @rchikhi's question "This is what we'd like to plot, right?". Yes, this is the type of information we need. However, that looks like VADR output, which we we do not want to use for the genome structure annotation. The concern here is that VADR annotates by comparison with the nearest RefSeq, and it seems likely it will degrade and perhaps fail to identify genes as we get into the more novel, and hence more interesting, genomes. I believe more robust approach for this task is to use a generic ORF-finder and run HMMs on those ORFs directly, bypassing VADR. It's a complementary approach, and IMO better for genome structure annotation, especially of the more distant genomes we're trying to figure out like the Espy's.
Actually, if I got more RdRP's (2876) from the PFAM alignments than VADR is reporting (2424 if I understand correctly), then this seems to confirm my point that PFAM on naive ORFs is more robust.
OK thanks for the extensive explanation.
ORF-finding + HMM would be a full-fledged viral annotation pipeline, no? looking e.g. at VGAS (which detects and BLASTs ORFs, see pipeline). There's also VIGOR (ORF+BLAST too) which seems to be doing well on CoV's, compared to GeneMarkS and ZCURVE. So I'm concerned we're reinventing the wheel. Also, if anyone knows (@asl @rcedgar?): why are all these state of the art gene finding tools using ORF finding +BLAST and not ORF finding + HMM?
Looks like I can answer my own question partially, by quoting the VIGA article:
"For a more accurate protein function prediction, HMMER [48] can be executed to predict functions according to Hidden Markov Models"
Sounds like VIGA implements what @rcedgar suggested in #226 (comment)
No, on the contrary we should use very simple naive ORF-finder like emboss getorf. We don't need to worry about frameshifts because they don't occur inside the conserved regions of the domains, and we don't need to extend the ORFs to the boundaries, so a very naive algorithm is exactly what we need because no understanding of the genome is required. The term "ORF" is overloaded -- usually "open reading frame" means a maximal contiguous set of codons from START to STOP in one frame; this is the meaning I'm intending here. Unfortunately, in viral genomes "ORF" is abused to mean something else, not quite sure how to define it, I think it's the entire coding region for the polyprotein before it is cleaved, but I'm still confused on this point.
Oops, our comments crossed. VIGA might be fine, don't know it.
Yep crossed comments! I think we both agree on the meaning of ORFs. (didn't know virologists changed its meaning - might be worth clarifying with a specialist. A few joined #serratus a while back!)
For the sake of minimizing efforts, I'll start with VIGA, and roll my own pipeline only if needed.
"why are all these state of the art gene finding tools using ORF finding +BLAST and not ORF finding + HMM?" It's been a while since I looked closely at homology detection and I don't know what is considered state of the art for gene finding. BLAST is faster and there are many more known proteins than there are PFAM HMMs. HMMs are better for finding very diverged domains, but for gene finding, you generally don't want very deep homology detection because you only find highly conserved domains (another overloaded term...) and finding a very diverged domain does not identify a gene in the usual sense, only a fold or shorter structural motif; the function may be very different in different genes sharing the same domain or motif. SH2 domain is a classic example.
In our case, we need HMMs because viruses sequences evolve very fast while maintaining function, even within Cov I've seen pairs of RdRP's at ~20% a.a. identity which is getting to the limit of blastp detection. With typical genes in bacteria or eukaryotes, the function has often changed at this divergence, in viruses very often not.
ah, great to know!!
OK Viga was a pain to install and had too many other unwanted features. I'm rolling my own using getorf. Two questions:
- Which HMM database to use?
Darth uses: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam_SARS-CoV-2_2.0/Pfam-A.SARS-CoV-2.hmm
Viga uses: http://dmk-brain.ecn.uiowa.edu/pVOGs/ - @taltman and @asl do
transeq -frame 6
beforehmmsearch
instead of getorf. This doesn't find ORFs but is more sensitive, right?
Answered on slack: 1. Pfam_SARS not pVOGs (which is phage). 2. both would work, transeq seems indeed the more sensitive option.
I don't understand what is going on here. What is the problem? Can someone catch me up?
So the initial motivation for this issue is to do a figure for the paper. For each single-contig assembly, plot where the genes are. Like this:
(image taken from Internet)
But according to #226 (comment) we can't reuse VADR output as it doesn't use a HMM. So I ran hmmsearch
on each contig and parsed HMMER3 output to get the coordinates of domains on contigs. Pretty much like what you did in Darth (in canonicalize_contigs.sh
,and in darth.sh
for ORF1ab but this was only to get the sequence not its coordinates on the original contig).
In the end I got an output like that for each contig for all assemblies:
start_coord_on_contig, end_coord_on_contig, domain_name, contig_name, is_complete_domain
[..]
(13443, 14499, 'CoV_RPol_N', 'SRR11939968.coronaspades.NODE_1_length_29527_cluster_1_candidate_1_domains_40_1', True),
(14598, 16065, 'RdRP_1', 'SRR11939968.coronaspades.NODE_1_length_29527_cluster_1_candidate_1_domains_40_1', True),
(17238, 17913, 'Viral_helicase1', 'SRR11939968.coronaspades.NODE_1_length_29527_cluster_1_candidate_1_domains_40_1', True),
[..]
So with this info now in hand, I'm on track for plotting the structure for all the genomes of interest (OTUs representatives, etc)
That's right, VADR doesn't use a HMM. It uses a much more sensitive CM, and it uses BlastX (or, optionally, a HMM) to confirm in amino-acid space the hit in nucleotide space.
By using that approach, VADR was able to find an RdRP in SRR057858 whereas the Pfam search missed it (as mentioned last night on Slack). So it is erroneous to assume that the VADR output is not of use.
Are you using the ordering and orientation of the contigs as canonicalize_contigs.sh does? Otherewise, for fragmented genomes, the domain ordering can be all over the place relative to the "expected" CoV domain organization.
I was planning on also including the UTR annotations, and possibly other Rfam findings.
Note that in the example above, it is showing the entire polyproteins, and not the maturation protein domains.
By using that approach, VADR was able to find an RdRP in SRR057858 whereas the Pfam search missed it (as mentioned last night on Slack). So it is erroneous to assume that the VADR output is not of use.
@taltman Something is not right here. coronaSPAdes was able to find RdRP on the edges of the graph using HMM alignments. We basically did transeq -6
on the assembly result and again there was am HMM match. So, looks like something is not correct inside the annotation pipeline, so the RdRP alignment got missed?