thibautjombart/adegenet

Compoplot of DPAC assigning individuals complete membership of clusters issue

mattbareno opened this issue · 4 comments

I'm attempting to conduct a DPAC and visualize with compoplot. I have a vcf which i'm converting to a genlight object (vcfR). I then followed the steps laid out here. The pipeline looks like this:

#reads filtered vcf file in as vcf object
vcf <- read.vcfR(file = "[file name].vcf")

#converts to genlight object (usable for adegenet)
x <- vcfR2genlight(vcf)

#variable that stores the number of clusters (K)
y <- 13

#PCA, 100 PCs and 3 DCs, max K=y
grp <- find.clusters(x, max.n.clust=y, n.pca= 100, n.clust=y)

#DAPC
dapc1 <- dapc(x, grp$grp, n.pca = 100, n.da = 3)

compoplot(dapc1, posi="bottomright",
txt.leg=paste("Cluster", 1:y), lab=, ncol=1,
xlab="Posterior Probability of membership to new sub-population",
horiz = TRUE, space = 0, show.lab = TRUE, col=funky(y))

When i set K=3 and 13, i get these results:
image
image

Each individual is assigned complete membership to only one cluster, despite the fact that i know this is not true because this is a replication of a past population structure analysis. Also, this is very implausible.

If you need any other information, please ask. Thanks!

The error is in this part of your code:

#PCA, 100 PCs and 3 DCs, max K=y
grp <- find.clusters(x, max.n.clust=y, n.pca= 100, n.clust=y)

#DAPC
dapc1 <- dapc(x, grp$grp, n.pca = 100, n.da = 3)

You are overfitting the model. You are using 100 PCs to find your clusters and then you are trying to differentiate those clusters using 100 PCs, which will give you a perfect fit. The tutorial explains this.

zkamvar, i assumed it was redundant. Thank you for the clarification!

The error is in this part of your code:

#PCA, 100 PCs and 3 DCs, max K=y
grp <- find.clusters(x, max.n.clust=y, n.pca= 100, n.clust=y)

#DAPC
dapc1 <- dapc(x, grp$grp, n.pca = 100, n.da = 3)

You are overfitting the model. You are using 100 PCs to find your clusters and then you are trying to differentiate those clusters using 100 PCs, which will give you a perfect fit. The tutorial explains this.

I understand the issue here, but if this is the case, then what is the general guideline for number of PCs to keep in the DPAC analysis? Keep in mind this has 120 isolates/individuals

I understand the issue here, but if this is the case, then what is the general guideline for number of PCs to keep in the DPAC analysis? Keep in mind this has 120 isolates/individuals

This is determined through exploratory analysis. Section 4 of the dapc tutorial (here is a more updated link: https://github.com/thibautjombart/adegenet/raw/master/tutorials/tutorial-dapc.pdf) discusses this. Use xvalDapc() to find the optimal number of PCs. We wrote up a small section for this here as well: https://grunwaldlab.github.io/Population_Genetics_in_R/DAPC.html#cross-validation-dapc-analysis-of-phytophthora-ramorum-from-forests-and-nurseries