heche-psb/wgd

Differentiation between true wgd and artifact

Closed this issue · 19 comments

Hi, Thanks for the tool and its proper documentation!

I successfully run the analysis for my paralogs and I got the results attached from wgd ksd, wgd syn. However, I am confused about the peak consideration. As you can see in the file with the paranome ks distribution, I am getting two peaks at 0.04 and 0.5. and I am also attaching the anchor pair file and wgd peak result of these anchor pairs. I think 0.5 ks value is more reliable as compared 0.05, but I am not able to figure this out how should I negiate the peak at 0.05 as false positive as it can also be due to transposon activity and subgenome divergence. Please help!
WGD_paralogs.pdf
final_true_cds_WH_single.fasta.tsv.ksd.pdf
AnchorKs_PeakCI_final_true_cds_WH_single.fasta.tsv.ks.tsv_node_weighted.pdf

Thanks
Manohar

Hi, how does the dotplot look like? If this species really experiences a very recent WGD, we expect to see a bulk of long collinear segments throughout most of its chromosomes.

Hi, thanks for the suggestion. I am attaching both the dot plot and the ks value dot plot here. However, the genome is not at the chromosomal level. can you able to figure out anything from this? and also, is there any way we can reduce the noise in these dot plots?
final_true_cds_WH_single.fasta-vs-final_true_cds_WH_single.fasta.dot.pdf

final_true_cds_WH_single.fasta-vs-final_true_cds_WH_single.fasta_Ks.dot.pdf
Thanks
Manohar

Hi, a quick operational note, that you may set the option --minseglen 100000 and --mingenenum 50 to filter out short collinear segments in terms of the length in base and the number of residing genes, so as to drop those noisy dots, as well as setting --minlen 1000000 to skip those short scaffolds at the very beginning.

Hi, sure, I will use the --mingle and --mingenenum parameters. However --minlen is already set to 10% of the maximum scaffold length, right? so according to that it will be more than 1Mb, in my case. I will revert with new dot plots. Thanks!

Yes, the default --minlen is the 10% of the longest scaffold.

So I rerun wgd syn, and I am getting almost a similar plot. Can we identify any wgd event from this? (PS: the species is tetraploid)
final_true_cds_WH_single.fasta-vs-final_true_cds_WH_single.fasta.dot.pdf

The Ks peak around Ks ~ 0 is often observed in transcriptome and fragmentary genome assemblies. The former is usually casued by the redundancy introduced from alternative splicing while the latter is usually caused by the misassembly that two fragments representing the same scaffold failed to be merged but remained as two highly similar fragments instead. One thing you may check is the location of those anchor pairs with Ks < 0.1. They might all come from small fragments instead of those longer ones, which will be indicative of problematic assembly. A second thing you may check is to annotate the transposable elements (TEs) among anchor pairs, by simply marking those anchor pairs identified as TEs with another color, so as to rule out the possibility of TEs-derived peak. Another practice you may try, is to first divide your input cds file as sub-genome A and sub-genome B, and then construct their respective Ks distributions and inspect the presence of Ks peak ~ 0. If at sub-genome level the Ks peak ~ 0 is gone, you may conclude that the incipient Ks peak ~ 0 actually marked the divergence of sub-genomes.

Thanks for the quick and constructive reply. I will check the origin of Anchor pairs. However, about the sub-genome part, my assembly is not at the chromosome level, but can we divide our input CDS or proteins at the Sub-genome level without dividing the genome, which is not possible for scaffold-level genome assembly?

As to the division of sub-genomes, you may refer to the latest paper https://doi.org/10.1016/j.tig.2024.03.008. I think k-mer based methods are possible for your assembly.

Thank you for the reply. However, most of the tool require a chromosome level genome assembly including K-mer based tool subPhaser. Further, to check the reason for the peak at Ks value 0.05, I think that could also be due to the presence of high segmental duplication (collinearity), which is around 60% in my genome using MCscanx and also evident from the anchor pair graph (attached above). And the high colleniarty among the subgenome is quit logical also?

Just my opinion, when it comes to tetraploid, if the subgenome can not be properly parted or accommodated, it should be claimed with caution for high segmental duplication observed at Ks peak close to zero, which is for sure a possibility but more a speculation to validate after ruling out other alternative scenarios.

The Ks peak around Ks ~ 0 is often observed in transcriptome and fragmentary genome assemblies. The former is usually casued by the redundancy introduced from alternative splicing while the latter is usually caused by the misassembly that two fragments representing the same scaffold failed to be merged but remained as two highly similar fragments instead. One thing you may check is the location of those anchor pairs with Ks < 0.1. They might all come from small fragments instead of those longer ones, which will be indicative of problematic assembly. A second thing you may check is to annotate the transposable elements (TEs) among anchor pairs, by simply marking those anchor pairs identified as TEs with another color, so as to rule out the possibility of TEs-derived peak. Another practice you may try, is to first divide your input cds file as sub-genome A and sub-genome B, and then construct their respective Ks distributions and inspect the presence of Ks peak ~ 0. If at sub-genome level the Ks peak ~ 0 is gone, you may conclude that the incipient Ks peak ~ 0 actually marked the divergence of sub-genomes.

Thanks for the suggestion, What output file from wgd syn, I should use to check the Ks value < 0.1?

Hi, you may use the *.anchors.ks.tsv file to inspect those anchor pairs with Ks < 0.1

Hi,
So,
I am checking for the mixture model clustering for anchor ks pairs, and I got the following result. I am confused about which component to consider for a true wgd
Original_AnchorKs_GMM_Component4_node_weighted_Lognormal.pdf

Hi, the component 1 warrants your further inspect for TEs and residing segments while the component 0 is more clearly an evidence of WGD and component 3 is more likely the result of over-parameterization as to the excessive number of components.

Thank you. what would be the best way to inspect TE at component 1?

Hi, I suggest that first performing an extra TE annotation on those anchor pairs and second marking those detected TEs distinctly in the Ks distribution such that you could have a basic idea of the TE impact.

But my question was how TE annotation can be done in gene pairs (CDS sequences)?

Hi, I think you can resort to softwares that are specifically designed for TE annoation.