pavlin-policar/openTSNE

Question on initialization

sbembenek18 opened this issue · 4 comments

The default initialization is PCA -- is that correct? So, is it using the top 50 PCs for the TSNE embedding?
If I wanted to just run my data as is -- what initialization would allow for this?

thanks!

That's right -- the default initialization is PCA. However, t-SNE embeds data into 2D, so we here take the top 2 principal components of the data matrix, and use that as the initialization for the embedding. However, this refers only to the starting positions of the points in the 2D embedding, not to the actual input to the t-SNE algorithm. openTSNE uses the full data matrix, so if you want to do any preprocessing, e.g., taking only the top 50 PCs and using that, you'll have to do this yourself.

So, to answer your question, if you want to construct a t-SNE embedding for your data as is, openTSNE does this by default.

OK. So, given a data matrix with features, openTSNE, as it's default initialization, calculates the PCs, then takes only the top 2 PCs for initialization. After initialization, the full data matrix with the original (non PCs) features is used to perform the embedding.

If I actually wanted to use e.g., the first 50 PCs as my features as input for the embedding, I would simply calculate this ahead of time and pass this to openTSNE. And to avoid having openTSNE calculate the PCs again, I would (as you showed in '04_large_data_sets') initialize with:

init = openTSNE.initialization.rescale(X[:, :2])

and then use:

openTSNE.TSNE(initialization=init ...)`

To be sure, the parameter n_components is the dimension of the embedding space for tSNE, and your PCA initialization has to use this same number of PCs as well.

Is this correct?

Thanks!

That's all spot on!

Thanks!