ababaian/serratus

Extracting genome structure [from Darth, or not]

rchikhi opened this issue · 18 comments

I think the file [accession].darth/[accession]/[accession].vadr.pass.tblgives the genome structure:

E.g.

>Feature ERR4238643.coronaspades.NODE_1_length_29820_cluster_1_candidate_1_domains_39
227     21516   gene
                        gene    ORF1ab
227     13429   CDS
13429   21516
                        product ORF1ab polyprotein
                        exception       ribosomal slippage
                        protein_id      ERR4238643.coronaspades.NODE_1_length_29820_cluster_1_candidate_1_domains_39_1
227     13444   CDS
                        product ORF1a polyprotein
                        protein_id      ERR4238643.coronaspades.NODE_1_length_29820_cluster_1_candidate_1_domains_39_2
227     766     mat_peptide
                        product leader protein
                        protein_id      ERR4238643.coronaspades.NODE_1_length_29820_cluster_1_candidate_1_domains_39_1
227     766     mat_peptide
                        product leader protein
                        protein_id      ERR4238643.coronaspades.NODE_1_length_29820_cluster_1_candidate_1_domains_39_2
767     2680    mat_peptide
                        product nsp2
                        protein_id      ERR4238643.coronaspades.NODE_1_length_29820_cluster_1_candidate_1_domains_39_1
[..]
13429   16197
                        product RNA-dependent RNA polymerase
                        protein_id      ERR4238643.coronaspades.NODE_1_length_29820_cluster_1_candidate_1_domains_39_1
13403   13441   mat_peptide
                        product nsp11
                        protein_id      ERR4238643.coronaspades.NODE_1_length_29820_cluster_1_candidate_1_domains_39_2
[..]

This is what we'd like to plot, right?

This information is present in another format in the .vadr.ftr files. For RdRP's, over all accessions, it looks like:

$ find | grep ftr$ | xargs grep RNA-dep
[..]
ERR3013320.coronaspades.NODE_1_length_28002_cluster_1_candidate_1_domains_38  28002  PASS  NC_003436  mat_peptide  RNA-dependent_RNA_polymerase:GBSEP:putative_coronavirus_nsp9_(RdRp)            2781   21    +   12571  15350        -  no        -      -        -      -    2    0  12571..12594:+,12594..15350:+  12597..12620:+,12620..15376:+  -
ERR3013321.coronaspades.NODE_1_length_28028_cluster_1_candidate_1_domains_37  28028  PASS  NC_003436  mat_peptide  RNA-dependent_RNA_polymerase:GBSEP:putative_coronavirus_nsp9_(RdRp)            2781   21    +   12592  15371        -  no        -      -        -      -    2    0  12592..12615:+,12615..15371:+  12597..12620:+,12620..15376:+  -
ERR3013323.coronaspades.NODE_1_length_28159_cluster_1_candidate_1_domains_38  28159  FAIL  NC_003436  mat_peptide  RNA-dependent_RNA_polymerase:GBSEP:putative_coronavirus_nsp9_(RdRp)            2781   21    +   12723  15502        -  no        -      -        -      -    2    0  12723..12746:+,12746..15502:+  12597..12620:+,12620..15376:+  -
ERR3013322.coronaspades.NODE_1_length_28076_cluster_1_candidate_1_domains_37  28076  FAIL  NC_003436  mat_peptide  RNA-dependent_RNA_polymerase:GBSEP:putative_coronavirus_nsp9_(RdRp)            2781   21    +   12636  15415        -  no        -      -        -      -    2    0  12636..12659:+,12659..15415:+  12597..12620:+,12620..15376:+  -
[..]

Emphasis on the PASS/FAIL column. @rcedgar , how many RdRP's did you end up considering? I get the following numbers:

$ find | grep ftr$ | xargs grep RNA-dep | grep PASS |wc -l
1288
$ find | grep ftr$ | xargs grep RNA-dep | grep FAIL |wc -l
2424

(I believe this includes multiple RdRP's per accession)

I got 2875 RdRPs in the last round where multiple RdRP's per assembly were discarded.

To @rchikhi's question "This is what we'd like to plot, right?". Yes, this is the type of information we need. However, that looks like VADR output, which we we do not want to use for the genome structure annotation. The concern here is that VADR annotates by comparison with the nearest RefSeq, and it seems likely it will degrade and perhaps fail to identify genes as we get into the more novel, and hence more interesting, genomes. I believe more robust approach for this task is to use a generic ORF-finder and run HMMs on those ORFs directly, bypassing VADR. It's a complementary approach, and IMO better for genome structure annotation, especially of the more distant genomes we're trying to figure out like the Espy's.

Actually, if I got more RdRP's (2876) from the PFAM alignments than VADR is reporting (2424 if I understand correctly), then this seems to confirm my point that PFAM on naive ORFs is more robust.

OK thanks for the extensive explanation.
ORF-finding + HMM would be a full-fledged viral annotation pipeline, no? looking e.g. at VGAS (which detects and BLASTs ORFs, see pipeline). There's also VIGOR (ORF+BLAST too) which seems to be doing well on CoV's, compared to GeneMarkS and ZCURVE. So I'm concerned we're reinventing the wheel. Also, if anyone knows (@asl @rcedgar?): why are all these state of the art gene finding tools using ORF finding +BLAST and not ORF finding + HMM?

Looks like I can answer my own question partially, by quoting the VIGA article:

"For a more accurate protein function prediction, HMMER [48] can be executed to predict functions according to Hidden Markov Models"

Sounds like VIGA implements what @rcedgar suggested in #226 (comment)

No, on the contrary we should use very simple naive ORF-finder like emboss getorf. We don't need to worry about frameshifts because they don't occur inside the conserved regions of the domains, and we don't need to extend the ORFs to the boundaries, so a very naive algorithm is exactly what we need because no understanding of the genome is required. The term "ORF" is overloaded -- usually "open reading frame" means a maximal contiguous set of codons from START to STOP in one frame; this is the meaning I'm intending here. Unfortunately, in viral genomes "ORF" is abused to mean something else, not quite sure how to define it, I think it's the entire coding region for the polyprotein before it is cleaved, but I'm still confused on this point.

Oops, our comments crossed. VIGA might be fine, don't know it.

Yep crossed comments! I think we both agree on the meaning of ORFs. (didn't know virologists changed its meaning - might be worth clarifying with a specialist. A few joined #serratus a while back!)
For the sake of minimizing efforts, I'll start with VIGA, and roll my own pipeline only if needed.

"why are all these state of the art gene finding tools using ORF finding +BLAST and not ORF finding + HMM?" It's been a while since I looked closely at homology detection and I don't know what is considered state of the art for gene finding. BLAST is faster and there are many more known proteins than there are PFAM HMMs. HMMs are better for finding very diverged domains, but for gene finding, you generally don't want very deep homology detection because you only find highly conserved domains (another overloaded term...) and finding a very diverged domain does not identify a gene in the usual sense, only a fold or shorter structural motif; the function may be very different in different genes sharing the same domain or motif. SH2 domain is a classic example.

In our case, we need HMMs because viruses sequences evolve very fast while maintaining function, even within Cov I've seen pairs of RdRP's at ~20% a.a. identity which is getting to the limit of blastp detection. With typical genes in bacteria or eukaryotes, the function has often changed at this divergence, in viruses very often not.

ah, great to know!!

OK Viga was a pain to install and had too many other unwanted features. I'm rolling my own using getorf. Two questions:

  1. Which HMM database to use?
    Darth uses: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam_SARS-CoV-2_2.0/Pfam-A.SARS-CoV-2.hmm
    Viga uses: http://dmk-brain.ecn.uiowa.edu/pVOGs/
  2. @taltman and @asl do transeq -frame 6 before hmmsearch instead of getorf. This doesn't find ORFs but is more sensitive, right?

Answered on slack: 1. Pfam_SARS not pVOGs (which is phage). 2. both would work, transeq seems indeed the more sensitive option.

I don't understand what is going on here. What is the problem? Can someone catch me up?

So the initial motivation for this issue is to do a figure for the paper. For each single-contig assembly, plot where the genes are. Like this:
image
(image taken from Internet)

But according to #226 (comment) we can't reuse VADR output as it doesn't use a HMM. So I ran hmmsearch on each contig and parsed HMMER3 output to get the coordinates of domains on contigs. Pretty much like what you did in Darth (in canonicalize_contigs.sh ,and in darth.sh for ORF1ab but this was only to get the sequence not its coordinates on the original contig).
In the end I got an output like that for each contig for all assemblies:

start_coord_on_contig,   end_coord_on_contig,  domain_name,  contig_name,    is_complete_domain
[..]
(13443,  14499,  'CoV_RPol_N',       'SRR11939968.coronaspades.NODE_1_length_29527_cluster_1_candidate_1_domains_40_1',  True),
(14598,  16065,  'RdRP_1',           'SRR11939968.coronaspades.NODE_1_length_29527_cluster_1_candidate_1_domains_40_1',  True),
(17238,  17913,  'Viral_helicase1',  'SRR11939968.coronaspades.NODE_1_length_29527_cluster_1_candidate_1_domains_40_1',  True),
[..]

So with this info now in hand, I'm on track for plotting the structure for all the genomes of interest (OTUs representatives, etc)

That's right, VADR doesn't use a HMM. It uses a much more sensitive CM, and it uses BlastX (or, optionally, a HMM) to confirm in amino-acid space the hit in nucleotide space.

By using that approach, VADR was able to find an RdRP in SRR057858 whereas the Pfam search missed it (as mentioned last night on Slack). So it is erroneous to assume that the VADR output is not of use.

Are you using the ordering and orientation of the contigs as canonicalize_contigs.sh does? Otherewise, for fragmented genomes, the domain ordering can be all over the place relative to the "expected" CoV domain organization.

I was planning on also including the UTR annotations, and possibly other Rfam findings.

Note that in the example above, it is showing the entire polyproteins, and not the maturation protein domains.

asl commented

By using that approach, VADR was able to find an RdRP in SRR057858 whereas the Pfam search missed it (as mentioned last night on Slack). So it is erroneous to assume that the VADR output is not of use.

@taltman Something is not right here. coronaSPAdes was able to find RdRP on the edges of the graph using HMM alignments. We basically did transeq -6 on the assembly result and again there was am HMM match. So, looks like something is not correct inside the annotation pipeline, so the RdRP alignment got missed?