Low single copy OGs
francicco opened this issue · 5 comments
Hi
I’m running a pretty large study on 63 species of Butterflies, mainly Heliconiini. The phylogenetic framework spans ~70 Mys, but mainly within 30 Mys.
It seems like I retrieve a very low amount of scOGs ~400. This number can vary depending on how I treat missing genes, but stays always very low. My guess is that in many OGs with multiple paralogs there are false paralogues. This is an example:
Two clearly different proteins.
How can I fix this? Is there a parameter controlling the granularity like the inflation in MCL? Any advice.
Seeing these results makes me panic a bit.
Cheers
F
Hi,
2 things:
_ indeed some spurious hits exist. They will be eliminated in the next version I'm currently working on. For the time being, you could increase the p-value of Diamond search (e.g. 1e-5)
_ low copy of snOGs: this is very strange ... you might have a species (or 2) in your dataset with high levels of duplicates. Are these proteomes coming from 'gold genomes' or yours ?
These are my genomes.
1e-5 seems pretty high at the moment I'm using 1e-20 and I was thinking to decrease it further. These are closely related species I'm expecting the evalue to be even lower.
how do I check the level of duplication among species?
F
Hi,
actually my answer was not really good. Here is a better one.
If the goal is to get scOGs, (i) I would lower the species overlap parameter (from 0.5 to 0.3-0.4) and (ii) increase the number of species to validate a chimeric protein (from 3 to 40 for instance).
Finally, since some of your species are likely to be closely related, you should increase the kmer-size parameter at step1 (from 100 to 200-300).
best,
Romain
Hi Romain,
Thanks a lot for the clarification. I actually lowered the overlap to 0.1 and it works better, I also increase the evalue to 1e-40 and the -chimeric_nb_sp to 10. I didn't want to mess with the kmer parameter, I'll try with 200 and see.
Best,
F
So, the results improve using a higher kmer. I tried 100, 200, 300, 400. Between 300 and 400 there's a very little difference, I'm gonna use 300.
Thanks a lot
F