Ameobea/sprout

Using UMAP instead of t-SNE

Opened this issue · 2 comments

Currently, Sprout uses t-SNE, which makes it harder to see the similarity of niche anime to others. UMAP might be stronger in preserving distances, and better for discovering cluster. Also not sure if PaCMAP, LargeVis, and Isomap can do the same thing (probs not for TriMAP and ForceAtlas2).
P.S. reference to other "MAL maps" https://github.com/igfod13/MALmap https://github.com/platers/MAL-Map

I did try out UMAP for building the atlas visualization. However, for me the results weren't that good or interpretable compared to t-SNE.

Most of the embeddings I would up with looked like this one: https://github.com/igfod13/MALmap

A big, dense, homogenous blob without much structure or interest. Although t-SNE does seem to trade off some accuracy, I found that it resulted in a much better looking embedding that was more interesting to browse around in.

which makes it harder to see the similarity of niche anime to others

This is true, but idk if it's t-SNE's fault per se. I set up my node sizing code back when I first built the atlas a couple of years ago. Since then, I've collected much more data but the sizing code hasn't been changed. This causes the most popular anime to be even larger on the visualization and drown out others.

I'll look into the scaling of the vis and the sizing of individual nodes when I deploy the next version of the site. I actually trained up a whole new model with data up to a couple of months ago, but I was having some issues with the quality of the modle compared to what's currently live.

If you want to try it out, it's here: https://anime-preview.ameo.dev/

I'd be interested to hear your thoughts on recommendations for your own profile compared to live if you had time to try it out.

Here is one: assuming a person don't have MAL, and they pick the first few anime that comes to mind (espcially the niche ones) to see what else would be recommened. Such anime would be specifically "good" or "amazing" to them, but they should be able to rate the first algorithmically recommended anime on top of the list as "meh" or "bad" since they often do have bias for popular middle-of-the-road stuff.
I think a partial solution to the recommendation engine, other than just add negative weight based on popularity, is probably distill factors from anime. Amplifying the differences between different genres would help with niche genres and topics. I may be biased of buzzwording things like ICA and sparse matrix, since often niche anime is lumped with other niche anime, but clusters can still have general dimension.
P.S. Have you tried other visualization techniques for finding the sweet spot and emphasize differences rather than just lumping the most popular stuff in the center?