Which is more suitable for dimensionality reduction and visualization presentation/presentation matrix
Closed this issue · 1 comments
Hello, thank you for providing these useful tools for studying bacterial genomes, which have helped me a lot.
During the process of using panstripe
, I noticed that you used the tSNE dimensionality reduction algorithm to visualize the gene presentation/presence matrix. At the same time, I also noticed that some previous papers have attempted to use PCoA or PCA to visualize the results.
In my research, I hope to visually visualize the differences in pan-genome between two groups. Due to my poor knowledge, I do not know which method is more suitable for attempting to reduce the dimensionality of this 0/1 matrix. I compared the results of three algorithms on my data and there was a significant difference.I have reviewed all the issues and the papers you mentioned in the document, but have not found any discussion on this topic. Therefore, I would like to seek your advice here and hope to receive your answer.
Hi,
I would usually use these techniques as exploratory tools to investigate the structure present within the presence/absence matrix. It's usually best use this as an initial step and follow it up with additional analyses. As PCA is a linear technique and t-SNE is non-linear I would not expect them to necessarily give the same result.
As a nonlinear technique, t-SNE can be useful for exploring data across various scales within a single plot. While t-SNE focuses on preserving local distances, the approach requires careful interpretation. A helpful description is available here.
In contrast, PCA preserves larger distances between points in the plot. Multiple Correspondence Analysis (MCA) is an alternative to PCA designed for categorical data. I usually find PCA to be sufficient but you could also investigate if this works better for your matrix.