Refound Genes Exhibit In-Frame Stops and Frameshifts: Seeking User-Friendly Controls and Improved Understanding for Downstream Analysis
NickJD opened this issue · 8 comments
This issue post is for prosperity and has already been reported.
I have been using Panaroo for a while now but I am currently trying to take the DNA/Protein sequences reported in the combined_DNA/protein_CDS.fasta and pan_genome_reference.fa files and do some further downstream analysis with them using tools such as CDHIT. When doing this I noticed quite a high number of the refound genes had inframe stops (*). I am aware of the basics of how the refound system works (using BLAST to find unannotated genes present in most genomes but not in some - possibly due to annotation/assembly error etc).
I know from reading some of the issues on GitHub that this has partly already been raised (see below image and #65) but I have had a couple of runs which have resulted in some output I can’t quite figure out.
Below is an example of 5 refound genes which exhibit 2 different types of frameshift mutations. The first 4 have a removal of a ‘G’ and the last one has the addition of 2 ‘G’s. Both of these ‘mutations’ result in different gene lengths from the canonical genes (1,116 - canonical, 1,115 - missing ‘G’, 1,118 - additional ‘G’s). This is not a surprise as BLAST is gap/frame/codon agnostic.
This has resulted in the pan_genome_reference.fa file reporting the 24_refound_7533 sequence as the reference for this group/cluster but due to the frameshift, the amino acid sequence has multiple inframe stops and a very different sequence compared to the other sequences in this cluster.
As can be see in the above image that where the frameshifts occur, it completely breaks the continuity of the refound sequence in the amino acid space. To get these sequences, I got the IDs from the final_graph.gml file for the group/cluster, extracted the DNA sequences from ‘combined_DNA_CDS.fasta’ and converted to amino acid using table 11.
I ran (default parameters) panaroo on another set of genomes to see if this issue persisted - below are some examples.
>0_refound_0
IDLTFLWVLFRMSYKSSTLAWFLVNQKDV*X
>0_refound_1
MQNNRLTFMERLSERLTSVEAILKKLDPIESLLERIALLEKNIYX
Not every refound gene is like this but many are. Why do the refound genes end with an X? - Is this the uncertainty codon -> amino acid because of the frameshift making the sequence no longer modular 3? Most of the stops were like the first one *X at the end of the predicted proteins but a few had them scattered throughout the protein sequence.
To me, there is no obvious pattern.
>0_refound_22
MHKAPTFTSRGFVRKRKKGCILTHLQRIGTGHVKTASGYVSENARYRVVRREGCPLRCRCFKAKGNRTIELNHRLRKYRQNAKELLCSEEGLKHRGQSCIEPEAVFEQMKYNMNYKRFRHFGKDKVFMDFAFFAVAFNIKKMCAKMAKEGIDWLIKRFYELTVAIFRCCEHISKYCSLKX <-- X at the end no *
>0_refound_23
MDALFQTQSNELSTTTQKEKLFKDAERQWNTLSTEKGAPGRKVDFEIYKSILYNLEIRKVYLDSKLSLEKFSSIV <-- no X at the end
>0_refound_24
CVGKMPDMEGLRKETVL*SVMDL*WRKCSRRNLWG*SLLLRIKSERTIVATGEYTQEVRYYVTSLDNTRPEKIASAIRQHWSIGNNLHWQLDVTFREDYSKKVKNAAGNFSVATKMALTTLKNEKTTKGSMNLKRLKAGWDENYLSQLLQDNNF*X <-- X at the end with an , no internal X's next to the *
I have a few concerns regarding this. Not only do these frameshifted/pseudogenes inflate the number of core/soft-core etc gene families, but I do not see this being reported to the user.
panaroo -i GC*/GC*.gff3 --clean-mode strict -a core --aligner clustal --core_threshold 0.98 --remove-invalid-genes -t 30 -o xxx_Panaroo>
Core genes (99% <= strains <= 100%) 2501
Soft core genes (95% <= strains < 99%) 143
Shell genes (15% <= strains < 95%) 2887
Cloud genes (0% <= strains < 15%) 7414
Total genes (0% <= strains <= 100%) 12945
panaroo -i GC*/GC*.gff3 --clean-mode strict -a core --aligner clustal --core_threshold 0.98 --remove-invalid-genes --search_radius 0 -t 30 -o xxx_Panaroo2>
Core genes (99% <= strains <= 100%) 2366
Soft core genes (95% <= strains < 99%) 213
Shell genes (15% <= strains < 95%) 2954
Cloud genes (0% <= strains < 15%) 7639
Total genes (0% <= strains <= 100%) 13172
– I think there should be a way to formally and completely turn off the Refound gene option for those who only want to rely on the annotations they provide. The current options below do not allow for this and still return Refound genes.
Refind:
--search_radius SEARCH_RADIUS
the distance in nucleotides surronding (**<-Spelling Error**) the neighbour of an accessory gene in which to search for it
--refind_prop_match REFIND_PROP_MATCH
the proportion of an accessory gene that must be found in order to consider it a match
Lastly, those who use these gene sequences, from each group/cluster for a phylogenetic analysis (or any other analysis) would potentially get very different results if they were to use the DNA/Amino Acid sequences without being aware. Additionally, I think refound genes with insertions are likely to be chosen as the group representatives as they are the longest of the sequences and often the longest are chosen by clustering algorithms.
Other points:
-
I see that Table 11 is used as default to convert the DNA sequences to their protein counterparts, but the protein sequences you report can sometimes have another amino acid other than ‘M’ at their start. Even if a non-ATG codon is used as a start codon, the amino acid sequence always starts with ‘M’ (see - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3311535/).
-
The Gene Families which contain Refound genes should be labelled as such, especially if they are pushing these families into the soft-core and core categories. As many refound genes may be pseudogenes, likely, many users do not want them to be part of the 'core' etc. I don't think all users know that the gene collections they are providing Panaroo in the form of the PROKKA annotations are not the only genes forming their pangenome.
This has now been addressed in v1.4.*
Hi gtonkinhill,
Thanks for including this in v1.4.0. Could I please clarify, is this addressed by including the flag --search_radius 0
to avoid the inclusion of re-found genes?
Thanks!
Hi Fiona,
It's enabled using the new option --refind_strict
.
Thanks for seeking clarification. I will update the documentation to include this option.
Thank you!
Hi gtonkinhill,
I gave this a try but there are still refound genes included in my output; have I missed something?
panaroo -t 20 -i *.gff -o pan_out/ --clean-mode strict --refind_strict
spotted in stdout:
...
Number of searches to perform: 2379297
Searching...
209it [07:15, 2.08s/it]
translating hits...
removing by consensus...
Updating output...
Number of refound genes: 2341
collapse gene families with refound genes...
Processing depth: 1
...
grep "refound" pan_out/gene_presence_absence.csv | wc -l
1376
panaroo --version
panaroo 1.4.2
Thanks for your help!
Hi Fiona,
The --refind_strict
option just excludes refound genes that are not 'valid'. That is, it excludes genes with incorrect lengths or premature stop codons which may be the result of misassemblies.
As you're the second person who is interested in having the option to completely turn off the refinding step. I will look at adding this to the next release.
Hi Greg,
Ah I see- that makes sense! Thanks for considering adding this as an option.
The latest release (v1.5.0) includes an option to turn off re-finding completely. Thanks for the suggestion!